For most statistical machine translation problems, one of the first difficult tasks that the researchers face is gathering the appropriate data. While it is straightforward to gather a parallel language corpus between two widely spoken languages, constructing a similar dataset for rare languages is very costly and time-consuming. The idea of crowdsourcing translation is to reduce the cost by hiring many non-professional translators. However, it is difficult to know the quality of non-professionals’ translated sentences before manually checking them, which is also costly.
For this assignment, you will be given a sentence corpus with multiple candidate translations generated by crowdsourcing. The goal is to choose a single sentence among the crowdsourced translations that is most similar to the reference sentences generated by professional translators. During the grading procedure, your choices will be evaluated using a BLEU score metric.
To begin, download the Homework 5 starter kit. You may either choose to develop locally or on Penn’s servers. For the latter, we recommend using the Biglab machines, whose memory and runtime restrictions are much less stringent than those on Eniac. The Biglab servers can be accessed directly using the command
ssh PENNKEY@biglab.seas.upenn.edu, or from Eniac using the command
Data. The data consist of three separate files:
data-train/train_translations.tsv: this file contains the source sentence, four LDC-translated (professional) sentences, four Turk-translated (non-professional) sentences, and four worker identification string which matches each non-professional sentence with each worker identification string. This data provides four reference sentences which students may use to train their own model. This file consists of 20% of the original data (358 sentences).
data-train/survey.tsv: this file contains information on Turk-translators (non-professional translators) in the following categories: Turk-translator id, native English speaker, native Urdu speaker, whether currently residing in India, whether currently residing in Pakistan, years of speaking English, and years of speaking Urdu.
data-train/train_postedited_translations.tsv: this file contains the post-edited versions of the translations, and the ranking judgement about their quality. The edits were made by U.S. residents who can also speak Urdu (in an expectation that these editors will provide natural English sentences from original sentences generated by Turkers). For each source sentence, there are a total of 10 post-edited sentences along with the editor’s worker identification string.
data-test/test_translations.tsv: this file contains total (1792) sentences with the following categories: Urdu source sentence, four Turk-translated (non-professional) sentences, and four worker identification string which matches each non-professional sentence with each worker identification string. Note that the first 358 sentences are from ‘data-train/train_translations.tsv’
Summary: Of the 1792 sentences, only the first 20% (358 sentences) will be provided to students as a training set, which also includes four reference sentences. It is suggested that the students use these training set to generate plausible features that may improve their scores. There will be a test set containing the entire 1792 sentences without any reference sentences. Specifically, this file will contain the following features: source (Urdu) sentence, Turk translation1, Turk translation2, Turk translation3, Turk translation4, WorkerID1, WorkerID2, WorkerID3, WorkerID4. Using the test set described above, the students are required to make predictions for the most probable candidate sentence.
Objective Function. For each source sentence , you must choose the best translation (‘solution’) among four non-professionally translated sentences (). Using the scoring metric (smoothed-BLEU), your solution set will be evaluated with each of the four reference sentences available to grader (). Your score for each source sentence will be the average of these four smoothed-BLEU scores, and your final score will be the average of the sentence-level scores. Thus, your score will be represented on 0.0 to 1.0 scale (higher scores are better).
Default Implementation. The default system simply selects the first translated sentence(the sentence translated by the first Turker on each task) as the most likely translation of each corresponding source sentence. This guarantees consistent output. To generate the default output, run:
python default > output.txt
To grade the output, run:
python grade < output.txt
The most pressing goal is to surpass the baseline. After you do this, you should aim to maximize the objective score on the test set. Clever feature engineering and application of machine learning techniques will be effective.
Doing some additional feature engineering (for example, adding sentence-level features) should be enough to beat our baseline. However, there will still be substantial room for improvement.
turnin -c cis526 -p hw5 hw5.txtfrom any Eniac or Biglab machine. You can upload new output as often as you like, up until the assignment deadline. You will only be able to see your results on test data after the deadline.
turnin -c cis526 -p hw5-code file1 file2 .... This is due 24 hours after the leaderboard closes. You are free to extend the code we provide or write your own in whatever langugage you like, but the code should be self-contained, self-documenting, and easy to use.
turnin -c cis526 -p hw5-report hw5-report.pdf. This is due 24 hours after the leaderboard closes. Your report does not need to be long, but it should at minimum address the following points:
Since we have already given you a concrete problem and dataset, you do not need describe these as if you were writing a full scientific paper. Instead, you should focus on an accurate technical description of the above items.
Note: These reports will be made available via hyperlinks on the leaderboard. Therefore, you are not required to include your real name if you would prefer not to do so.
_Credits: This assignment was developed by Woonki Jeon and Allen Sirolly, under the guidance of Chris Callison-Burch. The assignment instructions are adapted from those for Homework 4.