Skip to main content

Evaluation Challenge Problem 3

Automatic evaluation is a key problem in machine translation. Suppose that we have two machine translation systems. On one sentence, system A outputs:

This type of zápisníku was very ceněn writers and cestovateli.

And system B outputs:

This type of notebook was very prized by writers and travellers.

We suspect that system B is better, though we don’t necessarily know that its translations of the words zápisníku, ceněn, and cestovateli are correct. But suppose that we also have access to the following reference translation.

This type of notebook is said to be highly prized by writers and travellers.

We can easily judge that system B is better. Your challenge is to write a program that makes this judgement automatically.

Getting started

If you have a clone of the repository from previous homeworks, you can update it from your working directory:

git pull origin master

Alternatively, get a fresh copy:

git clone

Under the evaluate directory, you now have simple program that decides which of two machine translation outputs is better. Test it out!

python evaluate > eval.out

This assignment uses a very simple evaluation method. Given machine translations and and reference translation , it computes as follows, where is the count of words in that are also in .

$$f(h_1,h_2,e) = \left\{\begin{array}{lcc}1 & \textrm{if} & \ell(h_1,e) > \ell(h_2,e)\\ 0 & \textrm{if} & \ell(h_1,e) = \ell(h_2,e)\\-1 & \textrm{if} & \ell(h_1,e) < \ell(h_2,e) \end{array}\right.$$

We can compare the results of this function with those of human annotator who rated the same translations.

python compare-with-human-evaluation < eval.out

The Challenge

Your challenge is to improve the accuracy of automatic evaluation as much as possible. Improving the metric to use the simple METEOR metric in place of is sufficient to pass. Simple METEOR computes the harmonic mean of precision and recall. That is:

$$\ell(h,e) = \frac{P(h,e)\cdot R(h,e)}{(1-\alpha)R(h,e)+\alpha P(h,e)}$$

where and are precision and recall, defined as:

$$\begin{array}{c} R(h,e) = \frac{|h\cap e|}{|e|}\\ P(h,e) = \frac{|h\cap e|}{|h|} \end{array}$$

Be sure to tune the parameter that balances precision and recall. This is a very simple baseline to implement. However, evaluation is not solved, and the goal of this assignment is for you to experiment with methods that yield improved predictions of relative translation accuracy. Some things that you might try:

But the sky’s the limit! Automatic evaluation is far from solved, and there are many different solutions you might invent. You can try anything you want as long as you follow the ground rules:

Ground Rules

  • You can work in independently or in groups of up to three, under these conditions:
    1. You must announce the group publicly on piazza.
    2. You agree that everyone in the group will receive the same grade on the assignment.
    3. You can add people or merge groups at any time before the assignment is due. You cannot drop people from your group once you’ve added them. We encourage collaboration, but we will not adjudicate Rashomon-style stories about who did or did not contribute.
  • You must turn in three things:
    1. Your automatic judgements of the entire dataset, uploaded to the leaderboard submission site according to the Assignment 0 instructions. You can upload new output as often as you like, up until the assignment deadline.
    2. Your code. Send us a URL from which we can get the code and git revision history (a link to a tarball will suffice, but you’re free to send us a github link if you don’t mind making your code public). This is due at the deadline: when you upload your final answer, send us the code. You are free to extend the code we provide or roll your own in whatever langugage you like, but the code should be self-contained, self-documenting, and easy to use.
    3. A clear, mathematical description of your algorithm and its motivation written in scientific style. This needn’t be long, but it should be clear enough that one of your fellow students could re-implement it exactly.
  • You do not need any other data than what we provide. You are free to use any code or software you like, except for those expressly intended to evaluate machine translation output. You must write your own evaluation function. If you want to use part-of-speech taggers, syntactic or semantic parsers, machine learning libraries, thesauri, or any other off-the-shelf resources, go nuts. But evaluation software like BLEU, TER, METEOR, or their many variants are off-limits. You may of course inspect these systems if it helps you understand how they work. If you aren’t sure whether something is permitted, ask us. If you want to do system combination, join forces with your classmates.

Credits: This assignment was designed by Chris Dyer based on one we gave in 2012, which also inspired a whole series of papers.