Rosetta stone (credit; calotype46)

Leaderboard submission due 11:59 pm, Thursday, March 26th. Code and report due 11:59 pm, Friday, March 27th.

Reranking: Homework 4

In homework 2 you learned to search for probable translations, but saw that this is only useful with a good probability model. In homework 3 you designed a metric that correlated (at least somewhat) with human assessments of machine translation. Armed with such a metric, you now have an objective way to measure a model’s usefulness. In this assignment we will give you some sentences from Russian news articles, and you will use the metric to improve a model that chooses from a set of possible English translations generated by a state-of-the-art machine translation system. Your challenge is to choose the best translations.

Getting Started

To begin, download the Homework 4 starter kit. You may either choose to develop locally or on Penn’s servers. For the latter, we recommend using the Biglab machines, whose memory and runtime restrictions are much less stringent than those on Eniac. The Biglab servers can be accessed directly using the command ssh PENNKEY@biglab.seas.upenn.edu, or from Eniac using the command ssh biglab.

In the downloaded directory, you have a program that chooses a translation for each sentence from a list of candidates.

python rerank > english.out

The reranker reads candidate translations from the file data/dev+test.100best. Every candidate translation $e\in E(f)$ of an input sentence $f$ has an associated feature vector $h(e,f) = [ -\log p_{LM}(e)$ , $-\log p_{TM}(f|e)$ , $-\log p_{TM_{Lex}}(f|e)]$ . The reranker takes a parameter vector $\theta$ whose length is equal to that of $h(e,f)$ . By default, $\theta = [-1, -\frac{1}{2}, -\frac{1}{2}]$ . For each $f$ , the reranker returns $\hat{e}$ according to the following decision function.

$$\hat{e} = \arg\max_{e\in E(f)} \theta \cdot h(e,f)$$

To evaluate translations on the development set, compute BLEU score against their reference translations.

python compute-bleu < english.out

What’s the best you could you do by picking other sentences from the list? To give you an idea, we’ve given you an oracle for the devlopment data. Using knowledge of the reference translation, it chooses candidate sentences that maximize the BLEU score.

python oracle | python compute-bleu

The oracle should convince you that it is possible to do much better than the default reranker. Maybe you can improve it by changing the parameter vector $\theta$ . Do this using command-line arguments to rerank. Try a few different settings. How close can you get to the oracle BLEU score?

The Challenge

You can improve the parameter vector by trial and error, but that won’t be very efficient. To really improve the system you need automation. There are two components you can add: informative features that correlate with BLEU, and effective learning algorithms that optimize $\theta$ for BLEU. Your task is to improve translation quality on the blind test set as much as possible by improving these components.

Implementing a version of MERT or PRO along with some simple feature engineering should be enough to beat our baseline. However, there will still be substantial room for improvement. Here are some ideas:

Add a feature that counts the number of words.
Add a feature to count words that appear to be untranslated.
Learn new features from this word-aligned Russian-English text.
Develop a syntactic language model.
Add a neural language model.
Find a consensus translation using minimum Bayes risk.

But the sky’s the limit! You can try anything you want, as long as you follow the ground rules.

Ground Rules

You must work independently on this assignment.
You must turn in three things:
1. Your translations of dev+test.src, selected from dev+test.100best. Upload your results with the command turnin -c cis526 -p hw4 hw4.txt from any Eniac or Biglab machine. You can upload new output as often as you like, up until the assignment deadline. You will only be able to see your results on test data after the deadline.
2. Your code, uploaded using the command turnin -c cis526 -p hw4-code file1 file2 .... This is due 24 hours after the leaderboard closes. You are free to extend the code we provide or write your own in whatever langugage you like, but the code should be self-contained, self-documenting, and easy to use.
3. A report describing the models you designed and experimented with, uploaded using the command turnin -c cis526 -p hw4-report hw4-report.pdf. This is due 24 hours after the leaderboard closes. Your report does not need to be long, but it should at minimum address the following points:
  - Motivation: Why did you choose the models you experimented with?
  - Description of models or algorithms: Describe mathematically or algorithmically what you did. Your descriptions should be clear enough that someone else in the class could implement them.
  - Results: You most likely experimented with various settings of any models you implemented. We want to know how you decided on the final model that you submitted for us to grade. What parameters did you try, and what were the results? Most importantly: what did you learn?
  Since we have already given you a concrete problem and dataset, you do not need describe these as if you were writing a full scientific paper. Instead, you should focus on an accurate technical description of the above items.
  
  Note: These reports will be made available via hyperlinks on the leaderboard. Therefore, you are not required to include your real name if you would prefer not to do so.
You do not need any other data than what we provide. You are free to use any code or software you like, except for those expressly intended to generate or rerank translation output. You must write your own reranker. If you want to use machine learning libraries, taggers, parsers, or any other off-the-shelf resources, feel free to do so. If you aren’t sure whether something is permitted, ask us. If you want to do system combination, join forces with your classmates.

Credits: This assignment was developed by Adam Lopez, Matt Post, and Chris Callison-Burch. Chris Dyer made many improvements to this assignment.