##Introduction Statistical Machine Translation relies heavily on the availability of bilingual training data and high quality reference translation. Large parallel corpora mean more data to estimate the parameters of the translation model, which in turn implies better translation probabilites and higher translation quality. However, finding such bilingual corpora is not always feasible and building one from scratch can be a daunting and expensive task.
Similarly, one of the essential components of Statistical Machine Translation is high quality reference translation. An example of such translations is * *ﻑﺭﺎﻨﺳ کی ﺖﺟﻭیﺯ کی ﺢﻣﺍیﺕ which translates to France has supported the proposal
The traditional method of obtaining high translation quality is by employing professional translators or crawling the web for bilingual sources which prove to be very expensive and time consuming.
It is possible to decrease the cost of both of the above problems by outsourcing the translation effort to non-professionals. Following are the crowd-sourced translations for the above
However, naively considering these translations would lead to lower quality, inconsistent and biased results.
The task is to use these crowdsourced translations and the metadata associated with it to obtain near professional level translation quality. Such crowdsourced translations can serve as bilingual training data as well as high quality translations.
##Data The data has been split up into 3 files:
###Test and train set We plan to use 400 of the LDC Translations as the test set which will not be provided. This will be used for scoring on the leaderboard. The rest of the translations may be used as references for training.
##Scoring function The scoring objective function is the BLEU similarity of a chosen translated sentence to the 4 reference (LDC) translations. We have used a modified BLEU scoring function that calculates the ngram similarity with all of the 4 reference sentences instead of just 1. The BLEU score is then accumulated over the entire translation set.
The turker translations and metadata are to be used to generate features which then can be optimized using methods like PRO and MERT by using the reference translations given in the LDC Translations.
##Default System The starter kit would include a file default.py which is an implementation of a simple crowd-sourced translator. The default system involves naively choosing the first translation and submitting it for scoring (i.e. writing it to standard out).
You now have simple program that chooses a corwd-sourced translation as the final translation. Test it out using this command:
default.py > output
We can compute the bleu score of the output using the command
compute-bleu.py < output
Currently this system gives a BLEU score of 0.239.
##Baseline
To beat the baseline implement your own optimization method with a combination of more features like language model probabilities and TER.
##The Challenge
Your task for this assignment is to improve the accuracy of crowdsourced translations as much as possible. Implementing MERT with simple features like bigram and trigram scores should be enough to cross the baseline.
However, you can implement the following extensions to further improve the accuracy:
Your translations of turk_translations.tsv. Upload your results with the command turnin -c cis526 -p hw5 hw5.txt from any Eniac or Biglab machine. You can upload new output as often as you like, up until the assignment deadline. You will only be able to see your results on test data after the deadline
Your code, uploaded using the command turnin -c cis526 -p hw5-code file1 file2 …. This is due 24 hours after the leaderboard closes. You are free to extend the code we provide or write your own in whatever langugage you like, but the code should be self-contained, self-documenting, and easy to use.
A report describing the models you designed and experimented with, uploaded using the command turnin -c cis526 -p hw5-report hw5-report.pdf. This is due 24 hours after the leaderboard closes. Your report does not need to be long, but it should at minimum address the following points:
Results: You most likely experimented with various settings of any models you implemented. We want to know how you decided on the final model that you submitted for us to grade. What parameters did you try, and what were the results? Most importantly: what did you learn?
Since we have already given you a concrete problem and dataset, you do not need describe these as if you were writing a full scientific paper. Instead, you should focus on an accurate technical description of the above items.
Note: These reports will be made available via hyperlinks on the leaderboard. Therefore, you are not required to include your real name if you would prefer not to do so.