In homework 2 you learned to search for probable translations, but saw that this is only useful with a good probability model. In homework 3 you designed a metric that correlated (at least somewhat) with human assessments of machine translation. Armed with such a metric, you now have an objective way to measure a model’s usefulness. In this assignment we will give you some sentences from Russian news articles, and you will use the metric to improve a model that chooses from a set of possible English translations generated by a state-of-the-art machine translation system. Your challenge is to choose the best translations.
To begin, download the Homework 4 starter kit.
You may either choose to develop locally or on Penn’s servers. For the latter, we
recommend using the Biglab machines, whose memory and runtime restrictions are much
less stringent than those on Eniac. The Biglab servers can be accessed directly
using the command ssh PENNKEY@biglab.seas.upenn.edu
, or from Eniac using the
command ssh biglab
.
In the downloaded directory, you have a program that chooses a translation for each sentence from a list of candidates.
python rerank > english.out
The reranker reads candidate translations from the file
data/dev+test.100best
. Every candidate translation of
an input sentence has an associated feature vector
, , . The
reranker takes a parameter vector whose length is equal to
that of . By default, . For
each , the reranker returns according to the
following decision function.
To evaluate translations on the development set, compute BLEU score against their reference translations.
python compute-bleu < english.out
What’s the best you could you do by picking other sentences from the list? To give you an idea, we’ve given you an oracle for the devlopment data. Using knowledge of the reference translation, it chooses candidate sentences that maximize the BLEU score.
python oracle | python compute-bleu
The oracle should convince you that it is possible to do much
better than the default reranker. Maybe you can improve it by changing
the parameter vector . Do this using command-line
arguments to rerank
. Try a few different settings. How close can
you get to the oracle BLEU score?
You can improve the parameter vector by trial and error, but that won’t be very efficient. To really improve the system you need automation. There are two components you can add: informative features that correlate with BLEU, and effective learning algorithms that optimize for BLEU. Your task is to improve translation quality on the blind test set as much as possible by improving these components.
Implementing a version of MERT or PRO along with some simple feature engineering should be enough to beat our baseline. However, there will still be substantial room for improvement. Here are some ideas:
But the sky’s the limit! You can try anything you want, as long as you follow the ground rules.
Your translations of dev+test.src
, selected from dev+test.100best
.
Upload your results with the command turnin -c cis526 -p hw4 hw4.txt
from
any Eniac or Biglab machine. You
can upload new output as often as you like, up until the assignment deadline.
You will only be able to see your results on test data after the deadline.
Your code, uploaded using the command turnin -c cis526 -p hw4-code file1 file2 ...
.
This is due 24 hours after the leaderboard closes.
You are free to extend the code we provide or write your own in whatever
langugage you like, but the code should be self-contained,
self-documenting, and easy to use.
A report describing the models you designed and experimented with, uploaded
using the command turnin -c cis526 -p hw4-report hw4-report.pdf
. This is
due 24 hours after the leaderboard closes. Your report does not need to be
long, but it should at minimum address the following points:
Motivation: Why did you choose the models you experimented with?
Description of models or algorithms: Describe mathematically or algorithmically what you did. Your descriptions should be clear enough that someone else in the class could implement them.
Results: You most likely experimented with various settings of any models you implemented. We want to know how you decided on the final model that you submitted for us to grade. What parameters did you try, and what were the results? Most importantly: what did you learn?
Since we have already given you a concrete problem and dataset, you do not need describe these as if you were writing a full scientific paper. Instead, you should focus on an accurate technical description of the above items.
Note: These reports will be made available via hyperlinks on the leaderboard. Therefore, you are not required to include your real name if you would prefer not to do so.
Credits: This assignment was developed by Adam Lopez, Matt Post, and Chris Callison-Burch. Chris Dyer made many improvements to this assignment.