搜档网
当前位置:搜档网 › TP-MT-instructions

TP-MT-instructions

TP-MT-instructions
TP-MT-instructions

Practical work on Statistical MT with Moses

EPFL HLT Course, November 2014 – Andrei Popescu-Belis

0. Overview

The goal is to install (and test) Moses, then to build a small translation model in a language pair of your choice, and to run Moses in order to translate new sentences. Firstly this should be done with the default options, then gradually you can try to change things in order to improve the output.

For documentation: see https://www.sodocs.net/doc/6818108157.html,/moses/, but it is also possible to get an offline PDF version at https://www.sodocs.net/doc/6818108157.html,/moses/manual/manual.pdf.

For installation and simple testing, follow the instructions of "Getting Started with Moses" in the PDF manual or look at https://www.sodocs.net/doc/6818108157.html,/moses/?n=Development.GetStarted.

For training a translation model and running Moses, follow the instructions of "Baseline System" in the PDF manual or at https://www.sodocs.net/doc/6818108157.html,/moses/?n=Moses.Baseline.

1. Obtain Moses and install required software

Follow instructions at https://www.sodocs.net/doc/6818108157.html,/moses/?n=Development.GetStarted

but look also at https://www.sodocs.net/doc/6818108157.html,/moses/?n=Moses.Baseline for getting GIZA++.

Download Moses using the git version control system (which should be available on Mac and Linux, but needs installation for Cygwin): git clone git://https://www.sodocs.net/doc/6818108157.html,/moses-smt/mosesdecoder.git -- put this in a folder called ‘mosesdecoder’ for instance.

There are three requirements:

?Boost C++ libraries (see https://www.sodocs.net/doc/6818108157.html,)

?GIZA++ alignment software (https://www.sodocs.net/doc/6818108157.html,)

? language modeling toolkit (KenLM is provided, but SRILM or IRSTLM can be used too)

Boost should be installed on Linux; for Mac, you need to get the MacPorts application; for Cygwin, follow instructions at https://www.sodocs.net/doc/6818108157.html, for Linux. The installation via Cygwin's setup.exe did not work well for me, so I used the Linux instructions, which can be found locally at

/usr/local/boost_1_57_0/more/getting_ started/unix-variants.html. Check also the suggestions from the Moses Manual under “Manually installing Boost” (basically, running

‘bootstrap.sh’ and the n ‘b2 install’). As a result, there should be many files as

/usr/local/lib/libboost_* and even more in /usr/local/include/boost/). The Boost source files are in /usr/local/boost_1_52_0/. But all these paths can be adjusted, so if you have problems, make sure that GIZA++ and Moses do find the Boost libraries.

For GIZA++, read "Installing GIZA++" at https://www.sodocs.net/doc/6818108157.html,/moses/?n=Moses.Baseline). Get the code from the web with: wget https://www.sodocs.net/doc/6818108157.html,/files/giza-pp-v1.0.7.tar.gz, unpack it with: tar xzvf giza-pp-v1.0.7.tar.gz, then go to giza-pp directory and run make. This

will compile the code (no errors should appear) and as a result generate binaries called: GIZA++, snt2cooc.out, mkcls(.exe files on Cygwin). You might need to copy them in a folder visible for Moses (such as mosesdecoder/bin), or tell Moses where they are (it will complain if it doesn't find them).

Language model: Moses comes with its own language modeling tool called KenLM, which can be used too, either for creating or for evaluating the LM. You tell Moses which LM to use in the moses.ini file (see below). But you can also choose between SRILM and IRSTLM. Fortunately, all have the same output format, which can be binarized or not.

2. Compile Moses

Follow the instructions at https://www.sodocs.net/doc/6818108157.html,/moses/?n=Development.GetStarted, but read also the file at mosesdecoder/BUILD-INSTRUCTIONS.txt. Note: the web document called "Moses on Windows 7" which is linked from the Moses website seems deprecated.

In principle, compiling Moses can be as easy as running "cd mosesdecoder" and "./bjam" with some options (bjam is a boost builder, try "bjam --help" for instance). In any case, the compiling takes quite some time.

The most relevant options to "bjam" are (see also Moses GetStarted):

?indicate number of CPU cores to be used on your computer (e.g., "./bjam -j2" or "-j8") ?indicate which LM toolkit to use (e.g. "./bjam --with-srilm=/path/to/srilm").

Note: KenLM is always compiled, and there is no absolute need to compile with SRILM or

IRSTLM. If you get compilation errors, you might try to compile first without an LM. In fact, on Cygwin, it's not possible to compile with SRILM or IRSTLM. Even if they are not compiled with Moses, these tools can still be used to create LMs.

?stop on first error with '-q' (not obligatory, but may help upon the first attempts; in fact, some compilation errors are not blocking for later use; also, expect lots of warnings; so in the end it is better not to use ‘-q’).

?clean everything with '--clean' (useful if you want to restart after some errors and changes).

As a result, the compilation should terminate correctly, and you should see mosesdecoder/bin/moses among many other executables.

3. Test Moses with the provided TM and LM

Keep following the instructions at https://www.sodocs.net/doc/6818108157.html,/moses/?n=Development.GetStarted in the section "Run it for the first time".

Get (small) models from https://www.sodocs.net/doc/6818108157.html,/moses/download/sample-models.tgz, unpack them.

In phrase-models,modify if needed the phrase-model/moses.ini file to use the appropriate language model that you compiled with Moses, or just KenLM which is always available. To use KenLM, write: 8 0 3 lm/europarl.srilm.gz

Translate the example provided in the file ‘phrase-model/in’ by running this command: mosesdecoder/bin/moses -f phrase-model/moses.ini < phrase-model/in > out"

Alternatively, run Moses without 'in' and 'out' files, and type a sentence in German and press Enter: how well does it translate? Note that for this test TM and LM, the German and English vocabulary is extremely small. Take a look at phrase-model/phrase-table (it’s a readable text file) and count how many different words and phrases are stored. Try to vary your input sentences and see what happens.

The following two steps can also be performed in Section 4, with a new (larger) TM and LM:

?try to understand the messages and some options of the Moses decoder (bin/moses) by following the explanations given at https://www.sodocs.net/doc/6818108157.html,/moses/?n=Moses.Tutorial (named "Phrase-based Tutorial"); for instance, try the "trace" and the "verbose" options.

?try to manually tune the weights of the factors with the following parameters: weight-t, weight-l, weight-d, and weight-w (in moses.ini) or -t / -l / -d / -w in the command-line

options. See "Tuning for quality" in the above tutorial for some suggestions.

4. Learn a new TM and LM, use them to translate new sentences

Create a folder for your new experiments, follow https://www.sodocs.net/doc/6818108157.html,/moses/?n=Moses.Baseline. Get parallel data from: http://opus.lingfil.uu.se/

Suggestion: start with the smallest possible corpus (for instance http://opus.lingfil.uu.se/RF.php which has only 151 sentences) in order to check the entire processing chain without waiting too much. Make sure you download the right format ("download plain text files (MOSES/GIZA++)") for the language pair of your choice. For instance, try EN/FR or EN/DE.

4.1. Follow carefully the instructions at https://www.sodocs.net/doc/6818108157.html,/moses/?n=Moses.Baseline for corpus preparation. Namely, use the Moses tools to perform tokenization, true casing, and pruning of long sentences.

4.2. Learn a language model for the target language data, for instance with SRILM (see Lesson 5).

?/home/srilm/bin//ngram-count -text TRAINDATAFILE -lm LMFILE

Move the LM file (binarized or not) to your working directory. Test the LM with the "query" command as explained in the tutorial. Again, SRILM, IRSTLM and KenLM share the same format, so you can create the LM with SRILM even if you compiled Moses with another tool (KenLM if Cygwin).

4.3. Learn a translation model with "mosesdecoder/scripts/training/train-model.perl" as explained in the tutorial. This is very fast for 150 sentences, but gives a very poor model. The larger the corpus, the longer the training time, but the better the model.

Look at the newly created moses.ini file and check its settings for the LM. To use KenLM, write: [lmodel-file]

8 0 3 lm/europarl.srilm.gz

4.4. Translate some new sentences. Look at the training corpus so that you only use words that are known to the model (pay attention to their capitals too). How good are the translations?

If you didn't do it in Section 3, try the "trace" and "verbose" options of Moses. Try to manually tune the weights in moses.ini. Note that some factors now have weights for each subcomponent.

5. Tuning the weights of the factors of the Moses MT decoder

Get a new small parallel corpus in the same language to perform tuning of the weights in moses.ini, as explained at https://www.sodocs.net/doc/6818108157.html,/moses/?n=Moses.Baseline, section on Tuning. If point (4) above was fast, it means you can use a larger corpus. Tuning is typically the longest training stage because it involves translating the tuning set multiple times. Hence a tuning set much smaller than the training set is typically used.

6. Evaluating the results

Provide a test corpus (one sentence per line) to Moses and evaluate the quality of the automatic translations by measuring the BLEU score of the translated corpus. Use the implementation of the BLEU metric provided as mosesdecoder/scripts/generic/multi-bleu.perl – its arguments are one or more reference translations, one sentence per line, and the candidate translation (to be evaluated) is provided as an incoming stream (with ‘< filename’). The closer the domains of the training and test corpora are, the higher the BLEU score should be (typically in the 10-30% range). Testing on the training corpus should provide an unusually high BLEU score (e.g. 80-90%).

7. Extensions

7.1.New language pair with more data. You can use a language pair of your choice for which you find parallel data at http://opus.lingfil.uu.se/. Try now to learn a better translation model (using more data and tuning) and check your BLEU score. Don’t start too large: try a factor 10 or 100 every time you increase the data size to find out how long the entire process takes. Repeat the process with improved parameters (TM, LM, Moses options) and try to improve your BLEU score on your test data.

7.2 Streamlining the training process. Use the Experiment Management System (described in Sections 2.6.8 and 3.4 of the Moses manual) to simplify pre-processing and training (not tested on Cygwin). You will need to train the LM separately.

7.3 Possible topic for a course project. Build an MT system and test it on WMT 2011 data. Use training and tuning data as provided, but try to test only once (do not optimize on test data). Try to modify some advanced decoding parameters as described in Section 3.3 of the manual or online at Optimizing Moses. Training data (amount and domain) is still essential, as is computing power to speed up tuning and then decoding (consider the multi-threaded version). How does your BLEU score compare to the published results?

7.4 Possible topic for a course project. Combine the SMT decoder with Lucene to perform cross-lingual just-in-time information retrieval in another language. For instance, while you write in your native language, get results from English Wikipedia.

相关主题