Alignment : Challenge Problem 1
Aligning words is a key task in data-driven machine translation. We start with a large parallel corpus where the translation of each sentence is known. So the input consists of sentence pairs like this one:
le droit de permis passe donc de $ 25 à $ 500 . — we see the licence fee going up from $ 25 to $ 500 .
It is relatively easy to obtain data in this form, though not completely trivial. This particular example is from the Canadian Hansards, proceedings of government meetings that are required to be published in both English and French. To align the sentences we can exploit document boundaries and structural cues arising from the fact that the documents are translations of each other, such as the fact that translated sentences will be in the same order, of similar length, and so forth. The problem is that we don't know the word-to-word correspondences in each sentence. That's where you come in: your challenge is to write a program that aligns the words automatically. For the sentence above, you might output the following correspondences:
Some words might be unaligned (e.g. we and see), while some words might be aligned to multiple words in the corresponding sentence (e.g. passe is aligned to going up). Sometimes words aren't exact translations of each other. That's ok: even experienced bilinguals will sometimes disagree on the best alignment of a sentence. But while word alignment doesn't capture every nuance, it is very useful for learning translation models or bilingual dictionaries.
Getting Started
Run this command:
git clone https://github.com/alopez/dreamt.git
We have provided you with a very simple aligner written in Python and around two million words of the Canadian Hansards. The source code of the aligner is contained in file called align under the directory aligner. The baseline algorithm works in two phases. First, we train models. The aligner observes the training sentences and then gives a score, between 0 and 1, to each possible alignment. The baseline aligner simply computes Dice's coefficient between pairs of English and foreign words. It then aligns all pairs of words with a score over 0.5. Run the baseline heuristic model 1,000 sentences using the command:
align -n 1000 > dice.a
This runs the aligner and stores the output in dice.a. To view and score the alignments, run this command:
grade < dice.a
This command scores the alignment quality by comparing the output alignments against a set of human alignment annotations using a metric called the alignment error rate, which balances precision and recall of the guessed alignments (paper). Look at the terrible output of this heuristic model -- it's better than chance, but not any good. Try training on 10,000 sentences instead of 1,000, by specifying the change on the command line:
align -n 10000 | grade
Performance should improve, but only slightly! Another experiment that you can do is change the threshold criterion used by the aligner. How does this affect alignment error rate?
The Challenge
Your task for this assignment is to improve the alignment error rate as much as possible. It shouldn't be hard to improve it at least some: you've probably noticed that thresholding a Dice coefficient is a bad idea because alignments don't compete against one another. A good way to correct this is with a probabilistic model like IBM Model 1. It forces all of the English words in a sentence to compete as the explanation for each foreign word.
Formally, IBM Model 1 is a probabilistic model that generates each word of the foreign sentence \( {\bf f} \) independently, conditioned on some word in the English sentence \( {\bf e} \). The likelihood of a particular alignment \( {\bf a} \) of the foreign sentence therefore factors across words: \( P({\bf f}, {\bf a} | {\bf e}) = \prod_i P(a_i = j | |{\bf e}|) \times P(f_i | e_j) \). In Model 1, we fix \( P(a_i = j | |e|) \) to be uniform (i.e. equal to \( \frac{1}{|{\bf e}|} \)), so the likelihood only depends upon the conditional word translation parameters \( P(f | e) \).
The iterative EM update for this model is straightforward. For every pair of an English word type \( e \) and a French word type \( f \), you count up the expected (fractional) number of times tokens \(f\) are aligned to tokens of \(e\) and normalize over values of \(e\). That will give you a new estimate of the translation probabilities \(P (f |e)\), which leads to new alignment posteriors, and so on. We recommend developing on a small data set (1000 sentenes) and only a few iterations of EM. When you are finished, you should see some improvements in your alignments.
Developing a Model 1 aligner should be enough to beat our baseline system and earn seven points. But alignment isn't a solved problem, and the goal of this assignment isn't for you to just implement a well-known algorithm (in fact, you are not required to implement Model 1 at all, as long as you can beat it). To get full credit you'll need to experiment. Here are some ideas:
- Implement an HMM aligner (paper).
- Add a sparse prior to word alignment probabilities and learn the model using Bayesian inference (paper).
- Train a French-English model and an English-French model, then combine their predictions in some way (paper).
- Train a supervised discriminative feature-based aligner using the annotated development set. (paper).
- Train an unsupervised feature-based aligner using the annotated development set. (paper).
- Seek out additional inspiration.
But the sky's the limit! You can try anything you want as long as you follow the ground rules:
Ground Rules
-
You may work in independently or in groups of any size, under these
conditions:
- You must notify us by posting a public note to piazza.
- Everyone in the group will receive the same grade on the assignment.
- You can add people or merge groups at any time before the assignment is due. HOWEVER, you cannot drop people from your group once you've added them. Collaboration is fine with us, but adjudicating Rashomon-style stories about who did or did not contribute is not.
- You must turn in three things:
- An alignment of the entire dataset, uploaded to BASE_URL/assignment1.txt following the Assignment 0 instructions. You can upload new output as often as you like, up until the assignment deadline. The output will be evaluated using a secret metric, but the grade program will give you a good idea of how well you're doing, and you can use the check program to see whether your output is formatted correctly. Whoever has the highest score at the deadline will receive the most bonus points.
- Your code. Send us a URL from which we can get the code and git revision history (a link to a tarball will suffice, but you're free to send us a github link if you don't mind making your code public). This is due at the deadline: when you upload your final answer, send us the code. You are free to extend the code we provide or roll your own in whatever langugage you like, but the code should be self-contained, self-documenting, and easy to use.
- A brief description of your algorithm, posted to piazza. This can be of any length, as long as it is clear what you did. You have an additional two days to do this.
- You may only use data or code resources other than the ones we provide with advance permission. We'll probably ask you to make your resources available to everyone. So if, say, you have a cool idea for using the Berkeley parser, or a French-English dictionary, that's great. But to make things fair everyone should have access to the same set of resources, so we'll ask you to share the parses. This kind of constrained data condition is common in real-world evaluations of AI systems, to make evaluations fair. In keeping with this philosophy, we're much more likely to approve your request if you ask early. Do not ask the night before the assignment. A few things are off-limits: Giza++, the Berkeley Aligner, or anything else that already does the alignment for you. You must write your own code for alignment. If you want to do system combination, join forces with your classmates.
If you have any questions or you're confused about anything, just ask.
Credits: This assignment is adapted from one originally developed by Philipp Koehn, and later modified by John DeNero.