Evaluation Challenge Problem 3
Automatic evaluation is a key problem in machine translation.
Suppose that we have two machine translation systems. On one
sentence, system A outputs:
This type of zápisníku was very ceněn writers and cestovateli.
And system B outputs:
This type of notebook was very prized by writers and travellers.
We suspect that system B is better, though we don’t necessarily
know that its translations of the words zápisníku, ceněn,
and cestovateli are correct. But suppose that we also have
access to the following reference translation.
This type of notebook is said to be highly prized by writers and travellers.
We can easily judge that system B is better. Your challenge is to
write a program that makes this judgement automatically.
Getting started
If you have a clone of the repository from
previous homeworks, you can update it
from your working directory:
Alternatively, get a fresh copy:
git clone https://github.com/alopez/en600.468.git
Under the evaluate
directory, you now have simple program
that decides which of two machine translation outputs is better.
Test it out!
python evaluate > eval.out
This assignment uses a very simple evaluation method. Given
machine translations and and reference translation
, it computes as follows, where
is the count of words in that are also in .
$$f(h_1,h_2,e) = \left\{\begin{array}{lcc}1 & \textrm{if} & \ell(h_1,e) > \ell(h_2,e)\\ 0 & \textrm{if} & \ell(h_1,e) = \ell(h_2,e)\\-1 & \textrm{if} & \ell(h_1,e) < \ell(h_2,e) \end{array}\right.$$
We can compare the results of this function with those of human
annotator who rated the same translations.
python compare-with-human-evaluation < eval.out
The Challenge
Your challenge is to improve the accuracy of automatic evaluation as
much as possible. Improving the metric to use the simple METEOR metric
in place of is sufficient to pass. Simple METEOR computes
the harmonic mean of precision and recall. That is:
$$\ell(h,e) = \frac{P(h,e)\cdot R(h,e)}{(1-\alpha)R(h,e)+\alpha P(h,e)}$$
where and are precision and recall, defined as:
$$\begin{array}{c}
R(h,e) = \frac{|h\cap e|}{|e|}\\
P(h,e) = \frac{|h\cap e|}{|h|}
\end{array}$$
Be sure to tune the parameter that balances precision and
recall. This is a very simple
baseline to implement. However, evaluation is not solved,
and the goal of this assignment is for you to experiment with methods
that yield improved predictions of relative translation accuracy. Some
things that you might try:
But the sky’s the limit! Automatic evaluation is far from solved, and there
are many different solutions you might invent. You can try anything you want
as long as you follow the ground rules:
Ground Rules
- You can work in independently or in groups of up to three, under these
conditions:
- You must announce the group publicly on piazza.
- You agree that everyone in the group will receive the same grade on the assignment.
- You can add people or merge groups at any time before the assignment is
due. You cannot drop people from your group once you’ve added them.
We encourage collaboration, but we will not adjudicate Rashomon-style
stories about who did or did not contribute.
- You must turn in three things:
- Your automatic judgements of the entire dataset, uploaded to the leaderboard submission site according to the Assignment 0 instructions. You can upload new output as often
as you like, up until the assignment deadline.
- Your code. Send us a URL from which we can get the code and git revision
history (a link to a tarball will suffice, but you’re free to send us a
github link if you don’t mind making your code public). This is due at the
deadline: when you upload your final answer, send us the code.
You are free to extend the code we provide or roll your own in whatever
langugage you like, but the code should be self-contained,
self-documenting, and easy to use.
- A clear, mathematical description of your algorithm and its motivation
written in scientific style. This needn’t be long, but it should be
clear enough that one of your fellow students could re-implement it
exactly.
- You do not need any other data than what we provide. You are
free to use any code or software you like, except for those
expressly intended to evaluate machine translation output.
You must write your own evaluation function. If you want to use
part-of-speech taggers, syntactic or semantic parsers, machine
learning libraries, thesauri, or any other off-the-shelf resources,
go nuts. But evaluation software like BLEU, TER, METEOR, or their
many variants are off-limits. You may of course inspect these systems
if it helps you understand how they work. If you aren’t sure whether
something is permitted, ask us. If you want to do system combination,
join forces with your classmates.
Credits: This assignment was designed by Chris Dyer
based on one we gave in 2012, which also inspired a
whole
series
of
papers.