Image by foxtongue

Reranking Challenge Problem 4

In homework 2 you learned to search for probable translations, but saw that this is only useful with a good probability model. In homework 3 you designed a metric that correlated (at least somewhat) with human assessments of machine translation. Armed with such a metric, you now have an objective way to measure a model’s usefulness. In this assignment we will give you some sentences from Russian news articles, and you will use the metric to improve a model that chooses from a set of possible English translations generated by a state-of-the-art machine translation system. Your challenge is to choose the best translations.

Getting Started

Get the latest changes from the homework repo.

git pull origin master

Or, get a fresh copy.

git clone https://github.com/alopez/en600.468.git

Under the reranker directory, you have a program that chooses a translation for each sentence from a list of candidates.

python rerank > english.out

The reranker reads candidate translations from the file data/dev+test.100best. Every candidate translation $e\in E(f)$ of an input sentence $f$ has an associated feature vector $h(e,f) = [ -\log p_{LM}(e)$ , $-\log p_{TM}(f|e)$ , $-\log p_{TM_{Lex}}(f|e)]$ . The reranker takes a parameter vector $\theta$ whose length is equal to that of $h(e,f)$ . By default, $\theta = [-1, -\frac{1}{2}, -\frac{1}{2}]$ . For each $f$ , the reranker returns $\hat{e}$ according to the following decision function.

$$\hat{e} = \arg\max_{e\in E(f)} \theta \cdot h(e,f)$$

To evaluate translations on the development set, compute BLEU score against their reference translations.

python compute-bleu < english.out

What’s the best you could you do by picking other sentences from the list? To give you an idea, we’ve given you an oracle for the devlopment data. Using knowledge of the reference translation, it chooses candidate sentences that maximize the BLEU score.

python oracle | python compute-bleu

The oracle should convince you that it is possible to do much better than the default reranker. Maybe you can improve it by changing the parameter vector $\theta$ . Do this using command-line arguments to rerank. Try a few different settings. How close can you get to the oracle BLEU score?

The Challenge

You can improve the parameter vector by trial and error, but that won’t be very efficient. To really improve the system you need automation. There are two components you can add: informative features that correlate with BLEU, and effective learning algorithms that optimize $\theta$ for BLEU. Your task is to improve translation quality on the blind test set as much as possible by improving these components.

Implementing a version of MERT or PRO along with some simple feature engineering will be enough to beat our baseline and earn full credit. However, there will still be substantial room for improvement. Here are some ideas:

Add a feature that counts the number of words.
Add a feature to count words that appear to be untranslated.
Learn new features from this word-aligned Russian-English text.
Develop a syntactic language model.
Add a neural language model.
Find a consensus translation using minimum Bayes risk.

But the sky’s the limit! You can try anything you want, as long as you follow the ground rules.

Ground Rules

You can work in independently or in groups of up to three, under these conditions:
1. You must announce the group publicly on piazza.
2. You agree that everyone in the group will receive the same grade on the assignment.
3. You can add people or merge groups at any time before the assignment is due. You cannot drop people from your group once you’ve added them. We encourage collaboration, but we will not adjudicate Rashomon-style stories about who did or did not contribute.
You must turn in three things:
1. Your translations of dev+test.src, selected from dev+test.100best. Upload your results to the leaderboard submission site. You can upload new output as often as you like, up until the assignment deadline. You will be able to see your results on test data after the deadline.
2. Your code. Send us a URL from which we can get the code and git revision history (a link to a tarball will suffice, but you’re free to send us a github link if you don’t mind making your code public). This is due at the deadline: when you upload your final answer, send us the code. You are free to extend the code we provide or roll your own in whatever langugage you like, but the code should be self-contained, self-documenting, and easy to use.
3. A clear, mathematical description of your algorithm and its motivation written in scientific style. This needn’t be long, but it should be clear enough that one of your fellow students could re-implement it exactly.
You do not need any other data than what we provide. You are free to use any code or software you like, except for those expressly intended to generate or rerank translation output. You must write your own reranker. If you want to use machine learning libraries, taggers, parsers, or any other off-the-shelf resources, go nuts. If you aren’t sure whether something is permitted, ask us. If you want to do system combination, join forces with your classmates.

Credits: This assignment was developed by Adam Lopez, Matt Post, and Chris Callison-Burch. Chris Dyer made many improvements to this assignment.