CIS 526: Machine Translation (Spring 2015)
Christian Barcenas, David Liao, Yesha Ouyang, Fifi Yeung
Given a set of crowd-sourced translations for a corpus, how do you automatically choose the best translations?
Crowdsourcing is the process of procuring services and content by subdividing and contracting a project out to a large, diverse group of people, rather than from traditional employees or organizations. The rise of global online communities and the creation of platforms such as Amazon Mechanical Turk has played a major role in the increasing popularity of crowdsourcing tasks that are economically infeasible or computationally intractable to automate.
Compared to hiring professional translators, crowdsourcing is a very economically-friendly method of producing a set of translations. Depending on the target language and the degree of redundancy necessary, the cost of generating parallel corpora can often be an order of magnitude or lower using Mechanical Turk.
However, Amazon offers little opportunity to screen workers before they are selected for the task. Without external quality control, naive crowdsourcing results in disfluent, low-quality results that do not differ significantly from machine output. Some people will attempt to game the system (e.g. by using an existing machine translation system like Google Translate), and others will simply be insufficiently qualified to produce translations.
As a result of these concerns, the evaluations that subsequently provide reference tagged data primarily come from human experts. These evaluations can be used for a variety of purposes, including training and evaluating SMT models. If good-quality crowd-sourced translations could be extracted, then this provides a new avenue for cheaply producing reference data.
In homework 3, you have learned of metrics to objectively measure the quality of machine-produced translations. Armed with this knowledge, you will be given sentences in Urdu and a set of Turker-translated English sentences, as well as feature data from the Turkers themselves. Your tasks is to create a system to automatically filter out the garbage (i.e. low-quality and/or bad-faith) translations and choose the best individual sentences.
To begin, download the Homework 5 starter kit. You may either choose to develop
locally or on Penn’s servers. For the latter, we recommend using the Biglab
machines, whose memory and runtime restrictions are much less stringent than
those on Eniac. The Biglab servers can be accessed directly using the command
ssh PENNKEY@biglab.seas.upenn.edu
, or from Eniac using the command
ssh biglab
. You will also require the SciKit as a dependency.
In the starter kit, you will find three raw data files: reference.p
, translations.p
, and evaluator_features.p
. These pickle files contain the crowdsourced and reference translations, the crowdsourced evaluations of these translations, and other identifying features about the evaluators such as geographic data. You may use this data to generate features for your own models. Additionally, you will find a text file translator_features.tsv
containing feature data about the translators. Please note that data-train
and data-dev
are folders with the same exact data.
Specifically, each pickle data file contains a dictionary keyed by seg_id
, the unique string identifier for an Urdu sentence:
reference.p
is a dictionary of:
f
’: foreign sentencee
’: a list of the four reference english translationstranslations.p
is a dictionary of:
inputs
’: a 4-best list of the candidate translations for which you will be choosing the best single candidateinput_translators
’: a list of the uids of the Turkers that translated the sentences in inputs
votes
’: a list of (evaluator_uid
, best_translator_uid
) tuplesevaluator_uid
is the uid of the Turker that evaluated the given 4-best list of input sentences, and the best_translator_uid
is the uid that the evaluator Turker decided as the best translator, i.e. one of the four uids in input_translators
.evaluator_features.p
is a dictionary of:
time_taken
’: the time taken by the evaluator in secondscountry
’: the evaluator’s country of origindisplay_language
’: the evaluator’s display languageWe have provided a default program that chooses a translation for each sentence from a list of candidates.
./default > english.out
This ranker simply picks the first translation out of each of the candidates. To evaluate these translations on the development set, you can compute the BLEU score against their reference translations. As you know, the BLEU score is a metric for evaluating the quality of translated from one language to another, where quality is considered the correlation between the candidate output and the output of a human reference.
For this assignment, we include four human references in the training data, and the grade function that you will be scored on outputs the weighted average of the four BLEU scores.
./grade < english.out
What’s the best you could you do by picking other sentences from the list? To give you an idea, we’ve given you an oracle for the development data. Using knowledge of the reference translation, it chooses candidate sentences that maximize the BLEU score.
./oracle | ./grade
The oracle should convince you that it is possible to do much better than the default reranker. Maybe you can improve it by by learning some features from the crowdsourced evaluations of the translations. Try a few different settings. How close can you get to the oracle BLEU score?
Your task is to select the best sentences that improve the corpus level BLEU score as much as possible (higher is better).
For baseline, you can implement a weighted majority vote to choose the best translation. You could choose the majority vote vectors by hand, or you could employ machine learning techniques you have studied in the class. Features on the translator should be enough to help learn the weight vector.
Consider using features such as the time it took the evaluator to complete the task, default language, and region. Implementing this learning algorithm should be enough to beat baseline, however there is certainly room for improvement.
Here are some ideas:
Use sentence level features we have studied throughout the class to discriminate good translations from bad translations (1)
For each sentence-level feature, compute a corresponding feature over all of the respective Turker’s translations (1)
Use a binary classifier to best train weights for the feature vector. Some ideas include Powell method, Hill climb, and Gradient ascent, but feel free to use any other method you are familiar with.
Remove or penalize translations that may have originated from statistical translation systems like Google Translate (2)
Prioritize translators whose translations generally correlate with expert translations (3)
Incorporate ranking data from a separate Mechanical Turk task (1)
(1) Crowdsourcing Translation: Professional Quality from Non-Professionals
(2) The Language Demographics of Amazon Mechanical Turk
(3) Fast, Cheap, and Creative: Evaluating Translation Quality Using Amazon’s Mechanical Turk
Students will be given two data files, taken from
Zaidan and CCB’s paper. The
formatted.txt
file stores the raw evaluation features and Mechanical
Turk metadata that students will parse. The second file,
translations_train.tsv
, contains the reference translations from the
LDC, and the original sentences in Urdu. The LDC translations are for the
students’ reference for grade, and consist of 700 lines, only covering roughly
half of the corpus.
On the grading side, we have translations_test.tsv
, which contains
all 1456 lines of LDC translations so we can evaluate the students’ performances
on the leaderboard. They will only be graded on the last 746 lines. This is to
prevent them from overfitting or cheating if they use machine learning methods
or train against the BLEU score.
turnin -c cis526 -p hw5 hw5.txt
from any Eniac or Biglab machine. You can upload new output as often
as you like, up until the assignment deadline. You will only be able to
see your results on test data after the deadline.turnin -c cis526 -p hw5-code file1 file2 ....
This is due 24 hours after the leaderboard closes. You are free to extend
the code we provide or write your own in whatever language you like, but
the code should be self-contained, self-documenting, and easy to use.turnin -c cis526 -p hw5-report hw5-report.pdf
.
This is due 24 hours after the leaderboard closes. Your report does not
need to be long, but it should at minimum address the following points: