Aligning words is a key task in machine translation. We start with a large parallel corpus of aligned sentences. For example, we might have the following sentence pair from the proceedings of the bilingual Canadian parliament:
le droit de permis passe donc de $ 25 à $ 500.
we see the licence fee going up from $ 25 to $ 500.
Getting documents aligned at the sentence level like this is relatively easy: we can use paragraph boundaries and cues like the length and order of each sentence. But to learn a translation model we need alignments at the word level. That’s where you come in. Your challenge is to write a program that aligns words automatically. For example, given the sentence above, your program would ideally output these pairs:
le – the, droit – fee, permis – license, passe – going, passe – up, donc – from, $ – $, 25 – 25, à – to, $ – $, 50 – 50
Your program can leave words unaligned (e.g. we and see) or multiply aligned (e.g. passe aligned to going up). It will be faced with difficult choices. Suppose it sees this sentence pair:
I want to make it clear that we have to let this issue come to a vote today.
il est donc essentiel que cette question fasse le objet de un vote aujourd’ hui .
Your program must make a choice about how to align the words of the non-literal translations I want to make it clear and il est donc essentiel. Even experienced bilinguals will disagree on examples like this. So word alignment does not capture every nuance, but it is still very useful.
You must have git and python (2.7) on your system to run the assignments. Once you’ve confirmed this, run this command:
git clone https://github.com/alopez/en600.468.git
In the aligner
directory you will find a python program called
align
, which contains a complete but very simple alignment algorithm.
For every word, it computes the set of sentences that the word appears in.
Intuititvely, word pairs that appear in similar sets of sentences are likely
to be translations. Our aligner first computes the similarity of these sets with
Dice’s coefficient. Given
sets and , Dice’s coefficient is:
$$\delta(X,Y) = \frac{2 \times |X \cap Y|}{|X| + |Y|}$$
For any two sets and , will be a number between 0 and 1. The baseline aligner will align any word pair with a coefficient over 0.5. Run it on 1000 sentences:
python align -n 1000 > dice.a
This command stores the output in dice.a
. To compute accuracy, run:
python score-alignments < dice.a
This compares the alignments against human-produced alignments, computing alignment error rate, which balances precision and recall. It will also show you the comparison in a grid. Look at the terrible output of this heuristic method – it’s better than chance, but not any good. Try training on 10,000 sentences:
python align -n 10000 | python score-alignments
Performance should improve, but only slightly! Try changing the threshold for alignment. How does this affect alignment error rate?
Your task is to improve the alignment error rate as much as possible. It shouldn’t be hard: you’ve probably noticed that thresholding a Dice coefficient is a bad idea because alignments don’t compete against one another. A good way to correct this is with a probabilistic model like IBM Model 1. It forces all of the English words in a sentence to compete as the explanation for each foreign word.
Formally, IBM Model 1 is a probabilistic model that generates each word of the foreign sentence independently, conditioned on some word in the English sentence . Given , the joint probability of an alignment and translation factors across words: . In Model 1, we fix to be uniform (i.e. equal to ), so this probability depends only on the word translation parameters . But where do thes parameters come from? You will first learn them from the data using expectation maximization (EM), and then use them to align. EM attempts to maximize the observed data likelihood , which does not contain alignments. To do this, we marginalize over the alignment variable:
$$P({\bf e}|{\bf f}) = \prod_i \sum_j P(a_i = j | |{\bf e}|)$$
This problem can’t be solved in closed form, but we can iteratively hill-climb on the likelihood by first fixing some parameters, computing expectations under those parameters, and maximizing the likelihood as treating expected counts as observed. To compute the iterative update, for every pair of an English word type and a French word type , count up the expected number of times aligns to and normalize over values of . That will give you a new estimate of the translation probabilities , which leads to new expectations, and so on. For more detail, read this note. We recommend developing on a small data set (1000 sentences) with a few iterations of EM. When you see improvements on this small set, try it out on the complete data.
Developing a Model 1 aligner should be enough to beat our baseline system and earn a passing grade. But alignment isn’t a solved problem, and the goal of this assignment isn’t for you to just implement a well-known algorithm. To get full credit you must experiment with at least one additional model of your choice and document your work. Here are some ideas:
But the sky’s the limit! You are welcome to design your own model, as long as you follow the ground rules:
An alignment of the entire dataset, uploaded to the leaderboard submission site according to the Assignment 0 instructions. You can upload new output as often
as you like, up until the assignment deadline. The output will be evaluated
using a secret metric, but the grade
program will give you a good
idea of how well you’re doing, and you can use the check
program
to see whether your output is formatted correctly. Whoever has
the highest score at the deadline will receive the most bonus points.
Note. The upload site will reject files larger than 1 MB, so please reduce your file to only the first 1,000 lines before uploading, e.g.,
python align | head -n1000 > output.txt
If you have any questions or you’re confused about anything, just ask.
Credits: This assignment is adapted from one originally developed by Philipp Koehn and later modified by John DeNero. It incorporates some ideas from Chris Dyer.