Skip to main content

Crowdsourcing Translation: Homework 5


For most statistical machine translation problems, one of the first difficult tasks that the researchers face is gathering the appropriate data. While it is straightforward to gather a parallel language corpus between two widely spoken languages, constructing a similar dataset for rare languages is very costly and time-consuming. The idea of crowdsourcing translation is to reduce the cost by hiring many non-professional translators. However, it is difficult to know the quality of non-professionals’ translated sentences before manually checking them, which is also costly.

For this assignment, you will be given a sentence corpus with multiple candidate translations generated by crowdsourcing. The goal is to choose a single sentence among the crowdsourced translations that is most similar to the reference sentences generated by professional translators. During the grading procedure, your choices will be evaluated using a BLEU score metric.


  • Ambati, Vogel, and Carbonell. “Active Learning and Crowd-Sourcing for Machine Translation”
  • Snow, O’Connor, Jurafsky, and Ng. “Cheap and Fast–But Is It Good? Evaluating Non-Expert Annotations for Natural Language Tasks”
  • Omar Zaidan and Chris Callison-Burch. “Crowdsourcing Translation: Professional Quality from Non-Professionals.” 2011. In Proceedings ACL-2011.

Getting Started

To begin, download the Homework 5 starter kit. You may either choose to develop locally or on Penn’s servers. For the latter, we recommend using the Biglab machines, whose memory and runtime restrictions are much less stringent than those on Eniac. The Biglab servers can be accessed directly using the commandssh, or from Eniac using the commandssh biglab.

Data. The data consist of three separate files:

  • data-train/train_translations.tsv: this file contains the source sentence, four LDC-translated (professional) sentences, four Turk-translated (non-professional) sentences, and four worker identification string which matches each non-professional sentence with each worker identification string. This data provides four reference sentences which students may use to train their own model. This file consists of 20% of the original data (358 sentences).
  • data-train/survey.tsv: this file contains information on Turk-translators (non-professional translators) in the following categories: Turk-translator id, native English speaker, native Urdu speaker, whether currently residing in India, whether currently residing in Pakistan, years of speaking English, and years of speaking Urdu.
  • data-train/train_postedited_translations.tsv: this file contains the post-edited versions of the translations, and the ranking judgement about their quality. The edits were made by U.S. residents who can also speak Urdu (in an expectation that these editors will provide natural English sentences from original sentences generated by Turkers). For each source sentence, there are a total of 10 post-edited sentences along with the editor’s worker identification string.
  • data-test/test_translations.tsv: this file contains total (1792) sentences with the following categories: Urdu source sentence, four Turk-translated (non-professional) sentences, and four worker identification string which matches each non-professional sentence with each worker identification string. Note that the first 358 sentences are from ‘data-train/train_translations.tsv’

Summary: Of the 1792 sentences, only the first 20% (358 sentences) will be provided to students as a training set, which also includes four reference sentences. It is suggested that the students use these training set to generate plausible features that may improve their scores. There will be a test set containing the entire 1792 sentences without any reference sentences. Specifically, this file will contain the following features: source (Urdu) sentence, Turk translation1, Turk translation2, Turk translation3, Turk translation4, WorkerID1, WorkerID2, WorkerID3, WorkerID4. Using the test set described above, the students are required to make predictions for the most probable candidate sentence.

Objective Function. For each source sentence , you must choose the best translation (‘solution’) among four non-professionally translated sentences (). Using the scoring metric (smoothed-BLEU), your solution set will be evaluated with each of the four reference sentences available to grader (). Your score for each source sentence will be the average of these four smoothed-BLEU scores, and your final score will be the average of the sentence-level scores. Thus, your score will be represented on 0.0 to 1.0 scale (higher scores are better).

$$ \text{Score} = \frac{1}{4n}\sum_{i=1}^n \sum_{j=1}^4 \text{BLEU}(t^*_i, e_{ref,i,j}) $$

Default Implementation. The default system simply selects the first translated sentence(the sentence translated by the first Turker on each task) as the most likely translation of each corresponding source sentence. This guarantees consistent output. To generate the default output, run:

python default > output.txt

To grade the output, run:

python grade < output.txt

The Challenge

The most pressing goal is to surpass the baseline. After you do this, you should aim to maximize the objective score on the test set. Clever feature engineering and application of machine learning techniques will be effective.

Doing some additional feature engineering (for example, adding sentence-level features) should be enough to beat our baseline. However, there will still be substantial room for improvement.

Possible Extensions.

  • Use of Machine Learning along with new features from provided data
  • Use of Language Model

Ground Rules

  • You must work independently on this assignment.
  • You must turn in three things:
    1. Your . Upload your results with the commandturnin -c cis526 -p hw5 hw5.txt from any Eniac or Biglab machine. You can upload new output as often as you like, up until the assignment deadline. You will only be able to see your results on test data after the deadline.
    2. Your code, uploaded using the command turnin -c cis526 -p hw5-code file1 file2 .... This is due 24 hours after the leaderboard closes. You are free to extend the code we provide or write your own in whatever langugage you like, but the code should be self-contained, self-documenting, and easy to use.
    3. A report describing the models you designed and experimented with, uploaded using the command turnin -c cis526 -p hw5-report hw5-report.pdf. This is due 24 hours after the leaderboard closes. Your report does not need to be long, but it should at minimum address the following points:
      • Motivation: Why did you choose the models you experimented with?
      • Description of models or algorithms: Describe mathematically or algorithmically what you did. Your descriptions should be clear enough that someone else in the class could implement them.
      • Results: You most likely experimented with various settings of any models you implemented. We want to know how you decided on the final model that you submitted for us to grade. What parameters did you try, and what were the results? Most importantly: what did you learn?

      Since we have already given you a concrete problem and dataset, you do not need describe these as if you were writing a full scientific paper. Instead, you should focus on an accurate technical description of the above items.

      Note: These reports will be made available via hyperlinks on the leaderboard. Therefore, you are not required to include your real name if you would prefer not to do so.

  • You do not need any other data than what we provide. You are free to use any code or software you like, except for those expressly intended to rank translation output. You must write your own ranker. If you want to use machine learning libraries, taggers, parsers, or any other off-the-shelf resources, feel free to do so. If you aren’t sure whether something is permitted, ask us. If you want to do system combination, join forces with your classmates.

_Credits: This assignment was developed by Woonki Jeon and Allen Sirolly, under the guidance of Chris Callison-Burch. The assignment instructions are adapted from those for Homework 4.