Rosetta stone (credit; calotype46)

You should try to get an AER score below 35 for Homework 1

Leaderboard setup

For this course, we use an automated homework submission system built on Google App Engine. One component of this system is the leaderboard, which introduces a competitive element into the assignments with realtime display of student submissions according to their scores on assignment metrics. Participation in the leaderboard is optional, and submissions are anonymized with a handle of your choosing.

You should follow the steps in this document to make sure that everything is working for you.

Visit the course submission page at jhu-mt-2015.appspot.com. You will need a Google account, if you don’t already have one; use it to sign in.
This system is used to upload all homework assignments, but participation in the leaderboard is optional. To remove yourself from the leaderboard, uncheck the checkbox and click “Update”. Otherwise, choose a handle, or keep the default (your Google nickname).
Assignment 0 is a way for you to test the leaderboard functionality. Do this right away.
Assignment 0 expects a file containing a single number. Use the file submission dialog to upload your submissions. The leaderboard displays a simple nonmonotonic function of this number, and sorts the entries according to the highest value.
Now you are ready to work on the actual assignment - which is assignment 1.

Alignment Challenge Problem 1

Aligning words is a key task in machine translation. We start with a large parallel corpus of aligned sentences. For example, we might have the following sentence pair from the proceedings of the bilingual Canadian parliament:

le droit de permis passe donc de $ 25 à $ 500.

we see the licence fee going up from $ 25 to $ 500.

Getting documents aligned at the sentence level like this is relatively easy: we can use paragraph boundaries and cues like the length and order of each sentence. But to learn a translation model we need alignments at the word level. That’s where you come in. Your challenge is to write a program that aligns words automatically. For example, given the sentence above, your program would ideally output these pairs:

le – the, droit – fee, permis – license, passe – going, passe – up, donc – from, $ – $, 25 – 25, à – to, $ – $, 50 – 50

Your program can leave words unaligned (e.g. we and see) or multiply aligned (e.g. passe aligned to going up). It will be faced with difficult choices. Suppose it sees this sentence pair:

I want to make it clear that we have to let this issue come to a vote today.

il est donc essentiel que cette question fasse le objet de un vote aujourd’ hui .

Your program must make a choice about how to align the words of the non-literal translations I want to make it clear and il est donc essentiel. Even experienced bilinguals will disagree on examples like this. So word alignment does not capture every nuance, but it is still very useful.

Getting Started

You must have git and python (2.7) on your system to run the assignments. Once you’ve confirmed this, run this command:

git clone https://github.com/alopez/en600.468.git

In the aligner directory you will find a python program called align, which contains a complete but very simple alignment algorithm. For every word, it computes the set of sentences that the word appears in. Intuititvely, word pairs that appear in similar sets of sentences are likely to be translations. Our aligner first computes the similarity of these sets with Dice’s coefficient. Given sets $X$ and $Y$ , Dice’s coefficient is:

$$\delta(X,Y) = \frac{2 \times |X \cap Y|}{|X| + |Y|}$$

For any two sets $X$ and $Y$ , $\delta(X,Y)$ will be a number between 0 and 1. The baseline aligner will align any word pair with a coefficient over 0.5. Run it on 1000 sentences:

python align -n 1000 > dice.a

This command stores the output in dice.a. To compute accuracy, run:

python score-alignments < dice.a

This compares the alignments against human-produced alignments, computing alignment error rate, which balances precision and recall. It will also show you the comparison in a grid. Look at the terrible output of this heuristic method – it’s better than chance, but not any good. Try training on 10,000 sentences:

python align -n 10000 | python score-alignments

Performance should improve, but only slightly! Try changing the threshold for alignment. How does this affect alignment error rate?

The Challenge

Your task is to improve the alignment error rate as much as possible. It shouldn’t be hard: you’ve probably noticed that thresholding a Dice coefficient is a bad idea because alignments don’t compete against one another. A good way to correct this is with a probabilistic model like IBM Model 1. It forces all of the English words in a sentence to compete as the explanation for each foreign word.

Formally, IBM Model 1 is a probabilistic model that generates each word of the foreign sentence ${\bf f}$ independently, conditioned on some word in the English sentence ${\bf e}$ . Given ${\bf f}$ , the joint probability of an alignment ${\bf a}$ and translation ${\bf e}$ factors across words: $P({\bf f}, {\bf a} | {\bf e}) = \prod_i P(a_i = j | |{\bf e}|) \times P(f_i | e_j)$ . In Model 1, we fix $P(a_i = j | |{\bf e}|)$ to be uniform (i.e. equal to $\frac{1}{|{\bf e}|}$ ), so this probability depends only on the word translation parameters $P(f | e)$ . But where do thes parameters come from? You will first learn them from the data using expectation maximization (EM), and then use them to align. EM attempts to maximize the observed data likelihood $P({\bf e}|{\bf f})$ , which does not contain alignments. To do this, we marginalize over the alignment variable:

$$P({\bf e}|{\bf f}) = \prod_i \sum_j P(a_i = j | |{\bf e}|)$$

This problem can’t be solved in closed form, but we can iteratively hill-climb on the likelihood by first fixing some parameters, computing expectations under those parameters, and maximizing the likelihood as treating expected counts as observed. To compute the iterative update, for every pair of an English word type $e$ and a French word type $f$ , count up the expected number of times $f$ aligns to $e$ and normalize over values of $e$ . That will give you a new estimate of the translation probabilities $P (f |e)$ , which leads to new expectations, and so on. For more detail, read this note. We recommend developing on a small data set (1000 sentences) with a few iterations of EM. When you see improvements on this small set, try it out on the complete data.

Developing a Model 1 aligner should be enough to beat our baseline system and earn a passing grade. But alignment isn’t a solved problem, and the goal of this assignment isn’t for you to just implement a well-known algorithm. To get full credit you must experiment with at least one additional model of your choice and document your work. Here are some ideas:

Implement a model that prefers to align words close to the diagonal.
Implement an HMM alignment model.
Implement a morphologically-aware alignment model.
Use maximum a posteriori inference under a Bayesian prior.
Train a French-English model and an English-French model and combine their predictions.
Train a supervised discriminative alignment model on the annotated development set.
Train an unsupervised discriminative alignment model.
Seek out additional inspiration.

But the sky’s the limit! You are welcome to design your own model, as long as you follow the ground rules:

Ground Rules

You can work in independently or in groups of up to three, under these conditions:
1. You must announce the group by email to phi@jhu.edu.
2. You agree that everyone in the group will receive the same grade on the assignment.
3. You can add people or merge groups at any time before the assignment is due. You cannot drop people from your group once you’ve added them. We encourage collaboration, but we will not adjudicate Rashomon-style stories about who did or did not contribute.
You must turn in three things:
1. An alignment of the entire dataset, uploaded to the leaderboard submission site according to the Assignment 0 instructions. You can upload new output as often as you like, up until the assignment deadline. The output will be evaluated using a secret subset of sentence and you can use the check program to see whether your output is formatted correctly. Whoever has the highest score at the deadline will receive the most bonus points.
  
  Note. The upload site will reject files larger than 1 MB, so please reduce your file to only the first 1,000 lines before uploading, e.g.,
```
 python align | head -n1000 > output.txt
```
2. Your code. Send us a URL from which we can get the code and git revision history (a link to a tarball will suffice, but you’re free to send us a github link if you don’t mind making your code public). This is due at the deadline: when you upload your final answer, send us the code. You are free to extend the code we provide or roll your own in whatever langugage you like, but the code should be self-contained, self-documenting, and easy to use.
3. A clear, mathematical description of your algorithm and its motivation written in scientific style. This needn’t be long, but it should be clear enough that one of your fellow students could re-implement it exactly.
You may only use data or code resources other than the ones we provide with advance permission. We will ask you to make your resources available to everyone. If you have a cool idea using the Berkeley parser, or a French-English dictionary, that’s great. But we want everyone to have access to the same resources, so we’ll ask you to share the parses. This kind of constrained data condition is common in real-world evaluations of AI systems, to make evaluations fair. A few things are off-limits: Giza++, the Berkeley Aligner, or anything else that already does the alignment for you. You must write your own code. If you want to do system combination, join forces with your classmates.

If you have any questions or you’re confused about anything, just ask.

Credits: This assignment was adapted by Adam Lopez, Matt Post, and Chris Callison-Burch from one originally developed by Philipp Koehn and later modified by John DeNero. It incorporates some ideas from Chris Dyer.