Evaluation: Challenge Problem 3

Machine translation systems are typically evaluated through relative ranking. For instance, given the following German sentences:

Die Prager Börse stürzt gegen Geschäftsschluss ins Minus.
Nach dem steilen Abfall am Morgen konnte die Prager Börse die Verluste korrigieren.

...One machine translation system produces this output:

The Prague stock exchange throws into the minus against closing time.
After the steep drop in the morning the Prague stock exchange could correct the losses.

...And a second machine translation system produces this output:

The Prague stock exchange risks geschäftsschluss against the down.
After the steilen waste on the stock market Prague, was aez almost half of the normal tagesgeschäfts.

A plausible ranking of the translation systems would place the first system higher than the second. While neither translation is perfect, the first one is clearly easier to understand and conveys more of the original meaning. Your challenge is to write a program that ranks the systems in the same order that a human evaluator would.

There is great need to evaluate translation systems: to decide whether to purchase one system or another, or to assess incremental changes to a system — since, as you saw in the first two assignments, there are many choices in the design of a system that naturally lead to different translations. Ideally such comparisons between systems should be done by humans, but human rankings are slow and costly to obtain, making them less feasible when comparisons are must be done frequently or between large numbers of systems. Furthermore, as you saw in class, it is possible to use machine learning techniques to directly optimize machine translation towards an objective function. If we could devise a function that correctly ranked systems, we could, in principle achieve better translation through optimization. Hence automatic evaluation is a topic of intense study.

Getting Started

If you already have a clone of the repository from the previous assignments you can update it from anywhere in your project directory by running the command:

git pull origin master

Alternatively, clone a fresh version of the repository by running:

git clone https://github.com/alopez/dreamt.git

Under the new evaluate directory, We have provided you with a very simple evaluation program written in Python. There is also a directory containing a development dataset and a test dataset. Each dataset consists of a human translation and many machine translations of some German documents. The evaluator compares each machine translation to a human reference translation sentence by sentence, computing how many words they have in common. It then ranks the machine translation systems according to the percentage of words that also appear in the reference. Note that while we collect statistics for each sentence, and the best system on each sentence will vary, the final ranking is at the system level. Run the evaluator on the development data using this command:

evaluate > output

This runs the evaluation and stores the final ranking in output. You can see the rank order of the systems simply by looking at the file — the first line is the best system, the second is second best, and so on. To calculate the correlation between this ranking and a human ranking of the same set of systems, run the command:

grade < output

This command computes Spearman's rank correlation coefficient (\(\rho\)) between the automatic ranking and a human ranking of the systems (see Section 4 of this paper for an explanation of how the human rankings were obtained). A \( \rho \) of 1 means that the rankings are identical; a rank of zero means that they are uncorrelated; and a negative rank means that they are inversely correlated.

You should also rank the test data, for which we did not provide human rankings. To do this, run the command:

evaluate -d data/test > output

You can confirm that the output is a valid ranking of the test data using the check command:

check < output

Your ranking should be a total ordering of the systems — ties are not allowed.

The Challenge

Improving the evaluation algorithm should cause \(\rho\) to increase. Your task for this assignment is to obtain a Spearman's rank correlation coefficient that is as high as possible on the test data. Whoever obtains the highest \(\rho\) will receive the most points.

One way to improve over the default system is to implement the well-known BLEU metric. You may find it useful to experiment with BLEU's parameters, or to retokenize the data in some way. However, there are many, many alternatives to BLEU — the topic of evaluation is so popular that Yorick Wilks, a well-known researcher, once remarked that more has been written about machine translation evaluation than about machine translation itself. Some of the techniques people have tried may result in stronger correlation with human judgement, including:

But the sky's the limit! There are many, many ways to automatically evaluate machine translation systems, and you can try anything you want as long as you follow the ground rules:

Ground Rules

You may work in independently or in groups of any size, under these conditions:
1. You must notify us by posting a public note to piazza.
2. Everyone in the group will receive the same grade on the assignment.
3. You can add people or merge groups at any time before you post your final submission. HOWEVER, you cannot drop people from your group once you've added them. Collaboration is fine with us, but adjudicating Rashomon-style stories about who did or did not contribute is not.
You must turn in three things:
1. Your ranking of the systems in the test dataset, uploaded to BASE_URL/assignment3.txt following the Assignment 0 instructions. You can upload new output as often as you like. Your rankings will be evaluated using a hidden metric. However, the grade program will give you a good indication of how you're doing. Whoever has the highest score on the leaderboard at the assignment deadline will receive the most bonus points.
2. Your code. Send us a URL from which we can get the code and git revision history (a link to a tarball will suffice, but you're free to send us a github link if you don't mind making your code public). This is due at the deadline: when you upload your final answer, send us the code. You are free to extend the code we provide or roll your own in whatever langugage you like, but the code should be self-contained, self-documenting, and easy to use.
3. A description of your algorithm, posted to piazza. Brevity is encouraged, as long as it is clear what you did; a paragraph or even bullet points is fine. You should tell us not only about your final algorithm, but also things that you tried that didn't work, experiments that you did, or other interesting things that you observed while working on it. What did you learn? Do you think that this is a reasonable way to evaluate machine translation systems? Please post your response within two days of submitting your final solution to the leaderboard; we will withold your grade until we receive it.
You should feel free to use additional data resources such as thesauruses, WordNet, or parallel data. You are also free to use additional codebases and libraries except for those expressly intended to evaluate machine translation systems. You must write your own evaluation metric. However, if you want your evaluation to depend on lemmatizers, stemmers, automatic parsers, or part-of-speech taggers, or you would like to learn a metric using a general machine learning toolkit, that is fine. But translation metrics including (but not limited too) available implementations of BLEU, METEOR, TER, NIST, and others are not permitted. You may of course inspect these systems if you want to understand how they work, although they tend to include other functionality that is not the focus of this assignment. It is possible to complete the assignment with a very modest amount of python code. If you aren't sure whether something is permitted, ask us.

If you have any questions or you're confused about anything, just ask.