Transliteration is the task of transcribing a word or phrase originally composed in one writing system to another in an understandable form. Transliteration is used to translate “out of vocabulary” (OOV) words in the source language, which have no actual translation to the target language. Proper nouns - people, placess, and organizations - are the major source of OOVs, making transliteration a crucial problem, since they contain . For example, proper nouns make up 40% of search queries and over 70% of most searches and are especially important in interlingual communication in the global media.
This problem manifests between any two languages, even those who employ the same writing system, as it is most desirable to represent differences in pronunciations. It is even more crucial when moving between writing systems, as the source and target languages may have sets of phonemes that do not map bijectively. That is, the source and target languages may each contain phonemes that do not exist in the other language, but also be missing phonemes from the other language.
Side note: linguistically, transliteration actually refers to a map between graphemes (units of a writing system) rather than phonemes (units of a language’s sound base); transcription is a more accurate term for what you’re doing in this assignment. However, in the machine translation literature, transcription typically refers to converting speech to text representation rather than text to text.
The results of transcription and transliteration are often quite similar, as letters represent similar sounds in many languages. However, transcription does not try to maintain a bijective map between letters of two writing scripts in order to better represent pronunciation in the target language. This introduces ambiguity and increases the difficulty of back-transliteration into the source language, but results in a clearer form in the target language that can be understood and pronounced without an understanding of the source language.
Arabic to English transliteration is a particularly interesting problem since the writing scripts and phoneme sets of the two languages are particularly disparate. Additionally, the Arabic writing system may omits vowels, optionally using diacritics to resolve disambiguities.
Some examples. Note that all Arabic words are the same length (4 characters) but the English transliterations are variable length. In these examples and the project, all Arabic text is ordered left-to-right.
Run the default transliteration system:
./default > default.out
The default system assumes a monotonic, one-to-one mapping between characters in the source and target languages. From the training data, it builds a probability map from each Arabic character to the English character it is most commonly aligned to. Then, for each character in a test word, it simply chooses the most probable English character.
However, as we discussed above, the one-to-one assumption of the default system is not always the best assumption to make when designing a transliteration system.
./grade < default.out
The output of your transliteration system will be graded against a reference transliteration using the Levenshtein edit distance, which awards a penalty for each single-character edit (insertion, deletion, or replacement) that must be made to the first string so that it matches the second string. We award a smaller (0.5) penalty for substitution errors.
Alternatively, you can also directly pipe the output into the grader:
./default | ./grade
The training data is provided in data-train/arabic.train
, consisting of about 14000 pairs of English and Arabic transliterations. These pairs were taken from Chris Callison-Burch’s data set (from Transliterating From All Languages), and were originally generated by scraping Wikipedia article names. Each line contains an English word and an Arabic word, separated by a tab.
You have 1600 lines of Arabic testing data in data-dev/ar-pub.test
. The corresponding English transliterations have been provided for the first 800 lines in data-dev/en-pub.test
, so you may score yourself. The remainder will be used to calculate an official score for your submission. For your submission, you should include all 1600 lines so that we can grade appropriately.
Your task is to minimize the edit distance of the English transliteration.
To reach baseline, extending the default system by allowing characters from the source language to map to multiple characters in the destination language, and vice versa. One method is to identify certain digraphs in one language which will map to a single morpheme in the other language.
To improve beyond this, there are many different approaches that one may take, from using a phonetic intermediary representaton to spelling- or phrase-based algorithms.
A strong extension is applying Hidden Markov Models (HMMs) and alignments of arabic characters to multiple English letters. “Automatic Transliteration of Proper Nouns from Arabic to English.” Here is a simplified (and also somewhat slow) implementation of Hidden Markov Models, which you can choose to use in your project (in particular, the AppliedHMM class will suffice for this project). All that you would need to do then is create correctly formatted training data from the arabic and english word pairs that we have provided and feed it to the HMM.
An overview of machine transliteration work can be found in A Comparsion of Different Transliteration Models.”(2006), which compares hybrid, grapheme-, phoneme-, and correspondence-based systems. Some ideas are:
You must work independently on this assignment.
You should submit each of the following:
An alignment of the entire dataset, uploaded from any Eniac or Biglab machine
using the command turnin -c cis526 -p trans-hw5 hw5.txt
.
You may submit new results as often as you like, up until the assignment deadline.
The output will be evaluated using a secret metric,
but the grade
program will give you a good idea of how well you’re doing.
The top few positions on the leaderboard will receive bonus points on this assignment.
Your code, uploaded using the command turnin -c cis526 -p trans-hw5-code file1 file2 ...
.
This is due 24 hours after your leaderboard submission.
You are free to extend the code we provide or write your own in whatever
langugage you like, but the code should be self-contained,
self-documenting, and easy to use.
A report describing the models you designed and experimented with, uploaded
using the command turnin -c cis526 -p trans-hw5-report hw5-report.pdf
. This is
due 24 hours after your leaderboard submission. Your report does not need to be
long, but it should at minimum address the following points:
Motivation: Why did you choose the models you experimented with?
Description of models or algorithms: Describe mathematically or algorithmically what you did. Your descriptions should be clear enough that someone else in the class could implement them. If you used any papers, you should reference them and (potentially) describe any differences in your implementation; if you used any outside MT systems, you should describe them.
Results: You most likely experimented with various settings of any models you implemented. We want to know how you decided on the final model that you submitted for us to grade. Most importantly: what did you learn?
Since we have already given you a concrete problem and dataset, you do not need describe these as if you were writing a full scientific paper. Instead, you should focus on an accurate technical description of the above items.
Note: These reports will be made available via hyperlinks on the leaderboard. Therefore, you are not required to include your real name if you would prefer not to do so.
You may only use data or code outside of what is provided with advance permission. We will ask you to make your resources available to everyone. You should not used any pre-existing transliteration systems. However, you free to use external alignment or translation systems.
Any questions should be be posted on the course Piazza page.