Writing is an essential part of science. How will anyone know or care about your results if you don’t communicate them? As part of your homework assignments, we expect you to turn in a report describing what models you designed and experimented with. Since we have already given you a concrete problem and dataset, you do not need describe these as if you were writing a full scientific paper. So, you should focus on an accurate technical description of these things:
To help calibrate you for what to expect, we ask you to first read and evaluate the example writeups below. They are real examples from others who have implemented good solutions for Assignment 1, and if you can reimplement one of these, you are likely to get very good results. You should read these critically, as if you were reviewing for a scientific conference. Ask yourself the following questions:
Focus on clarity rather than novelty. It is fine if the reports describe models and algorithms that first appeared somewhere else in the scientific literature. In fact, we expect that to be the case for most of the reports. Your sole objective is to decide how clear the paper is. Once you start to internalize the idea of clear scientific writing, you can start to apply it to your own writing.
I implemented IBM2 (seeded with IBM1 probability), trained it both ways on the data. Kept all matches, kept match that was reversed in other, and kept 40% of the (top) alignments [not counting the others].
I also other things, but the only thing that helped the score was training a word stem distribution between IBM1 and IBM2 models, but I kept running out of ram when running it on the full data set.
Other things I tried:
Our model combines three basic insights about word alignment:
To model these effects, we use ideas from:
The basic model follows the form of Model 1. Given a French sentence and English sentence , we model alignments of the form , where each takes a value from 1 to , denoting the index of the English word to which the th French word is aligned. We do not model null alignment. The probability of a particular alignment and the English, given the French, is then:
$$p({\bf f},{\bf a}|{\bf e}) = \prod_{i=1}^n p(a_i=j|i,j,m,n) \times p(f_i|e_{a_i})$$
We define as , where is a normalization term, is a parameter that we tune on the development data, and . We used in our submission (Dyer et al. learn this parameter along with the others using a stochastic gradient step in EM). As increases the model prefers alignments that are closer to the diagonal. Under this model, the ’s are conditionally independent, and the posterior probability that aligns to is just:
$$p(a_i=j|{\bf f},{\bf e}) = \frac{p(a_i=j|i,j,m,n) \times p(f_i|e_{a_i})}{\sum_{j'\in[1,m]} p(a_i=j'|i,j',m,n) \times p(f_i|e_{a_i})}$$
We can then use this posterior as usual in the expectation maximization algorithm.
This model enforces that each French word is aligned to exactly one English word, but there is no similar constraint on the English words, which can align to arbitrarily many French words. If we enforced a similar constraint on the English words, inference would be intractable. One approximation is to learn translation models and —parameterized by and respectively—and combine their predictions. We do this at decoding time by taking the intersection of the most probable alignments given by each model. We can take it a step further by learning the two models together, modifying the posteriors so that, in expectation, they reflect a preference for one-to-one alignments. Denote the event that aligns to by . Then the posterior probability under both models is proportional to:
$$p(x_{ij}|{\bf f},{\bf e}) \propto p_{\theta_1}(a_i=j|{\bf f},{\bf e}) \times p_{\theta_2}(a_j=i|{\bf e},{\bf f})$$
Liang et al. 2006 show that this can be viewed as a heuristic approximation to a model that enforces one-to-one alignment on both English and French words. We simply use this value in the computation of the posterior probability for both alignment models.
We implemented these models in 92 lines of python code, and tested several combinations on the development set with 1000 sentences of training.
Model | AER |
---|---|
Dice | 0.68 |
Model 1 p(f|e) | 0.51 |
Model 1 p(e|f) | 0.45 |
Model 1 intersection | 0.38 |
Model 1 joint train | 0.44 |
Model 1 joint train + intersection | 0.32 |
Model 2 p(f|e) | 0.45 |
Model 2 p(e|f) | 0.36 |
Model 2 intersection | 0.33 |
Model 2 joint train | 0.37 |
Model 2 joint train + intersection | 0.27 |
Our final AER on the development set with all training data is 0.177.
Here is my algorithm:
Note: I didn’t really think of the algorithm this way, and I actually do the counting as I create the dataset.
I implemented the model described in Dyer et al. 2013, in addition to model 1.
A diagonal prior is imposed on the alignment links, and a mean-field approximation is used in the E step to estimate the posterior translation distribution with a Dirichlet prior.
Then, I applied several pre-processing steps to improve alignments: - lowercase the data - stem German and English with the Snowball stemmer (see stem-corpus.py) - split the compounds in German using cdec (see csplit.py and uncsplit.py)
Finally, I tuned the hyperparameters of the systems (diagonal tension, null alignment probability and Dirichlet prior strength) on the dev set to minimize AER using the simplex method.
Model | AER | Δ |
---|---|---|
Baseline | 0.792556 | |
Model 1 | 0.421544 | 0.37 |
+ diagonal prior | 0.290222 | 0.13 |
+ Dirichlet prior | 0.269823 | 0.02 |
+ compound split | 0.257015 | 0.01 |
+ stem | 0.243009 | 0.01 |
+ tune | 0.233541 | 0.01 |
Run the following command to replicate the experiments:
cat data/dev-test-train.de-en | python csplit.py | python stem-corpus.py | python modelc.py | python uncsplit.py | ./check | ./grade -n 0