In homework 2 you learned to search for probable translations, but saw that this is only useful with a good probability model. In homework 3 you designed a metric that correlated (at least somewhat) with human assessments of machine translation. Armed with such a metric, you now have an objective way to measure a model’s usefulness. In this assignment we will give you some sentences from Russian news articles, and you will use the metric to improve a model that chooses from a set of possible English translations generated by a state-of-the-art machine translation system. Your challenge is to choose the best translations.
Get the latest changes from the homework repo.
git pull origin master
Or, get a fresh copy.
git clone https://github.com/alopez/en600.468.git
Under the reranker
directory, you have a program that chooses
a translation for each sentence from a list of candidates.
python rerank > english.out
The reranker reads candidate translations from the file
data/dev+test.100best
. Every candidate translation of
an input sentence has an associated feature vector
, , . The
reranker takes a parameter vector whose length is equal to
that of . By default, . For
each , the reranker returns according to the
following decision function.
To evaluate translations on the development set, compute BLEU score against their reference translations.
python compute-bleu < english.out
What’s the best you could you do by picking other sentences from the list? To give you an idea, we’ve given you an oracle for the devlopment data. Using knowledge of the reference translation, it chooses candidate sentences that maximize the BLEU score.
python oracle | python compute-bleu
The oracle should convince you that it is possible to do much
better than the default reranker. Maybe you can improve it by changing
the parameter vector . Do this using command-line
arguments to rerank
. Try a few different settings. How close can
you get to the oracle BLEU score?
You can improve the parameter vector by trial and error, but that won’t be very efficient. To really improve the system you need automation. There are two components you can add: informative features that correlate with BLEU, and effective learning algorithms that optimize for BLEU. Your task is to improve translation quality on the blind test set as much as possible by improving these components.
Implementing a version of MERT or PRO along with some simple feature engineering will be enough to beat our baseline and earn full credit. However, there will still be substantial room for improvement. Here are some ideas:
But the sky’s the limit! You can try anything you want, as long as you follow the ground rules.
dev+test.src
, selected from dev+test.100best
.
Upload your results to the leaderboard submission site. You
can upload new output as often as you like, up until the assignment deadline.
You will be able to see your results on test data after the deadline.Credits: Chris Dyer made many improvements to this assignment.