The task is to translate from Urdu to English using existing grammar rules and language models.
##Getting Started
To begin, download the starter kit. You may either choose to develop locally or on Penn servers. For the latter, we recommend using the Biglab machines, whose memory and runtime restrictions are much less stringent than those on Eniac. The Biglab servers can be accessed directly using the command ssh PENNKEY@biglab.seas.upenn.edu, or from Eniac using the command ssh biglab.
You can run the default system using the command:
./default > 1-best.en
The default system uses a very simple method to generate a translation based on word-to-word translation and ignores complex grammars and language models.
word_e = argmax_P(word_e|word_f) word_e
To grade, use the command:
./grade-dev < 1-best.en
The grade program will evaluate the translation using BLEU score, and print the BLEU stats features as well as the BLEU score. the BLEU score is counted using the formula:
score = exp(min(0,(1 - refcount/hypcount))+ln(grammatch_1...grammatch_4/gramcount_1..gramcount_4)
Your task is to improve the score as much as possible.
To implement the chart decoder, you can use the ./baseline code, which provides the interface to grammar and language models.
To run the baseline, use
./baseline > hw5.txt
The data you may use:
the file to be translated - ‘./data-dev/test.ur’
grammar - ‘./data-train/grammar’
The format of grammar is as follows:
[S] | [X,1] | [X,1] | features | alignment |
for example
[X] | ” ” جالبین “ بھی | ” ” jalibeen “ “ also | 8.93287 6.92926 1 1.00000 0 0 | 0-0 1-1 2-2 4-5 |
The features are:
p(e | f) p(f | e) Lex(e | f) Lex(f | e) rarity phrase-penalty |
The feature value are the negative logarithm of the actual probabilit value, that is:
-log(p)
Alignment is:
source_word - target_word
lm contains 1-grams, 2-grams and 3-grams language model. lm values are the logarithm of language model probabilities. Note this is different from the grammar value.
tune data You may want to tune the parameters. You can use the data in ‘./data-dev/’ for tuning.
train data
You may want to train your grammar model by yourself. Then you can use the train data in ‘./data-train/train/’ to extract the SCFG rules.
Your work is to increase the BLEU score as much as you can. Implementing CKY algorithm and integrating language model with some simple feature engineering will help you beat the baseline. Specific algorithm can be found on textbook chapter 11. To improve:
You must work independently on this assignment.
You should submit each of the following:
An alignment of the entire dataset, uploaded from any Eniac or Biglab machine using the command turnin -c cis526 -p hw1 hw1.txt. You may submit new results as often as you like, up until the assignment deadline. The output will be evaluated using a secret metric, but the grade program will give you a good idea of how well you’re doing. The top few positions on the leaderboard will receive bonus points on this assignment.
Your code, uploaded using the command turnin -c cis526 -p hw1-code file1 file2 …. This is due 24 hours after the leaderboard closes. You are free to extend the code we provide or write your own in whatever langugage you like, but the code should be self-contained, self-documenting, and easy to use.
A report describing the models you designed and experimented with, uploaded using the command turnin -c cis526 -p hw1-report hw1-report.pdf. This is due 24 hours after the leaderboard closes. Your report does not need to be long, but it should at minimum address the following points:
Since we have already given you a concrete problem and dataset, you do not need describe these as if you were writing a full scientific paper. Instead, you should focus on an accurate technical description of the above items.
Note: These reports will be made available via hyperlinks on the leaderboard. Therefore, you are not required to include your real name if you would prefer not to do so.