Due October 7, 2021 11:59PM

Decoding is process of taking input in one language (e.g. French):

…And finding its best English translation under your model:

To decode, we need a model of English sentences conditioned on the
French sentence. You did most of the work of creating
such a model in word alignment. In this assignment,
we will give you some French sentences and a probabilistic model consisting of
a phrase-based translation model \(p_{\textrm{TM}}(\textbf{f},\textbf{a} \mid \textbf{e})\)
and an n-gram language model \(p_{\textrm{LM}}(\textbf{e})\). **Your
challenge is to find the most probable English translation under
the model.** We assume a noisy channel decomposition.

If you have a clone of the repository from word alignment, you can update it from your working directory:

```
git pull origin master
```

Alternatively, get a fresh copy:

```
git clone https://github.com/xutaima/jhu-mt-hw
```

Under the `hw3`

directory, you now have simple decoder.
Test it out (Make sure to use python3)!

```
python decode > translations
```

This creates the translations of `data/input`

.
You can compute \(p(\textbf{e} \mid \textbf{f})\) using `compute-model-score`

.

```
python compute-model-score < translations
```

This command sums over all possible ways that the model could have generated the English from the French, including translations that permute the phrases. This sum is intractable, but the phrase dictionary is fixed and sparse, so we can compute it in a few minutes. It is still easier to do this than it is to find the optimal translation. But if you look at this command you may get some hints about how to do the assignment!

The decoder generates the most probable translations that it can find, using three common approximations.

First, it seeks the *Viterbi approximation* to the most probable
translation. Instead of computing the intractable sum over
all alignments for each sentence, we simply find the best
single alignment and use its translation.

Second, it translates French phrases into English without
changing their order. So, it only reorders words if
the reordering has been memorized as a phrase pair.
For example, in the first sentence, we see that
*mardi
dernier*
is correctly translated as
*last
Tuesday*.
If we consult `data/tm`

, we will find that the model
has memorized the phrase
pair `mardi dernier ||| last Tuesday`

. But in the
second sentence, we see that
*Comité
de sélection*
is translated as
*committee
selection*,
rather than the more correct
*selection
committee*.
To show that this is a search problem rather than
a modeling problem, we can generate the latter output
by hand and check that the model really prefers it.

```
head -2 data/input | tail -1 > example
python decode -i example | python compute-model-score -i example
echo a selection committee was achievement . | python compute-model-score -i example
```

The scores are reported as log-probabilities, and higher
scores (with lower absolute value) are better. We
see that the model prefers
*selection
committee*,
but the decoder does not consider this word order.

Finally, our decoder uses strict pruning. As it consumes the input
sentence from left to right, it keeps only the highest-scoring
output up to that point. You can vary the number of number
of outputs kept at each point in the translation using the
`-s`

parameter. See how this affects the resulting model score.

```
python decode | python compute-model-score
python decode -s 10000 | python compute-model-score
```

Your task is to **find the most probable English translation**.
Our model assumes that any segmentation of the French sentence into
phrases followed by a one-for-one substitution and permutation of
those phrases is a valid translation. We make the
simplifying assumption that segmentation and ordering
probabilities are uniform across all sentences, hence constant.
This means that \(p(\textbf{e},\textbf{a} \mid \textbf{f})\) is proportional to
the product of the n-gram probabilities in \(p_{\textrm{LM}}(\textbf{e})\)
and the phrase translation probabilities in \(p_{\textrm{TM}}(\textbf{f},\textbf{a} \mid \textbf{e})\). To
avoid numerical underflow we work in logspace, seeking

The baseline decoder works with log probabilities, so you can simply follow what it does.

To pass, you must implement a beam-search
decoder which is capable of *reordering*,
shown in page 16 from the slides.
Any permutation of phrases is a valid translation, so we strongly suggest
searching over all or some part of this larger space. This search is
NP-Hard, so it will not be easy. You can trade efficiency for search effectiveness
by implementing histogram pruning or threshold pruning,
****or**** by using reordering limits as described in the textbook (Chapter 6).
To get full credit, you **must** additionally experiment with another decoding algorithm.
You might consider implementing other approaches to solving this combinatorial
optimization problem:

- Implement a greedy decoder.
- Use chart parsing to search over many permutations in polynomial time.
- Use a traveling salesman problem (TSP) solver.
- Use finite-state algorithms.
- Use Lagrangian relaxation.
- Use integer linear programming.
- Use A* search.

These methods all attempt to approximate or solve the Viterbi approximation to decoding. You can also try to approximate \(p(\textbf{e} \mid \textbf{f})\) directly.

But the sky’s the limit! There are many ways to decode. You can try anything you want as long as you follow the ground rules:

- You can work in independently or in groups of up to three, under these
conditions:
- You must announce the group publicly on piazza.
- You agree that everyone in the group will receive the same grade on the assignment.
- You can add people or merge groups at any time before the assignment is
due.
**You cannot drop people from your group once you’ve added them.**We encourage collaboration, but we will not adjudicate Rashomon-style stories about who did or did not contribute. - You must submit the assignment once per group on Gradescope, and indicate your collaborators once you upload the files.

- You must turn in three things to Gradescope:
- Your best translations of the entire dataset.
**Your translated file must be named**`translations`

. - Your code, uploaded to Gradescope. You are free to extend the code we provide but the code should be self-contained, and easy to use.
**You must submit at least two files**Please include a`decode`

and`decode-ext`

. While`python decode`

generates the translations from the basic decoding algorithm,`python decode-ext`

generates the translations from your best extended algorithm.`README`

for your code. Please also include a`requirement.txt`

file if you use any additional package. - A clear, mathematical description of your algorithm and its motivation written in scientific style, uploaded to Gradescope. This needn’t be long, but it should be clear enough that one of your fellow students could re-implement it exactly. If you modified your algorithm or have more than 1 algorithm, explain each modification/algorithm clearly.

- Your best translations of the entire dataset.
- You do not need any other data than what we provide. You can
free to use any code or software you like,
**except for those expressly intended to decode machine translation models**. You must write your own decoder. If you want to use finite-state libraries, solvers for traveling salesman problems, or integer linear programming toolkits, that is fine. But machine translation software including (but not limited to) Moses, cdec, Joshua, or phrasal is off-limits. You may of course inspect these systems if it helps you understand how they work. But be warned: they are generally quite complicated because they provide a great deal of other functionality that is not the focus of this assignment. It is possible to complete the assignment with a modest amount of python code. If you aren’t sure whether something is permitted, ask us. If you want to do system combination, join forces with your classmates. - Don’t wait till the last minute, this assignment is longer than the previous.

*Credits: This assignment was developed by Adam Lopez,
Matt Post,
and Chris Callison-Burch. Chris Dyer made many improvements to this assignment.*