The language research project will be due on the last day of class, Tuesday April 28th at 11:59pm. Note that the final HW is due at the same time, so we recommend that you do the language research project early.
So far in the class we have mainly focused on the mathematics and algorithms that underlie
machine translation systems. In this project, we will examine properties of languages.
Some of these properties make machine translation more challenging.
You will form a team and research a language. You will:
- collect information about the syntax and morphology of the language
- investigate where it is spoken, and what other languages its speakers are exposed to
- describe its writing system
- gather monolingual and bilingual data for the language
- create your own natural language processing tools for the language
For this exercise, we are going to target low resource languages which are languages that are
not handled well by statistical machine translation, usually because they have an insufficient
amount of training data. Here is a
list of languages that you can choose from.
If you would like to pick a different language, please ask for permission from your instructor.
This project is inspired by the upcoming
DARPA Low Resource Languages for Emergent Incidents (LORELEI) Program.
This is a group project, and you can work in teams of 2-6 people (the amount of work that
we expect from you will vary depending on the size of your team). Your write-up should be
completely in Markdown.
You can clone the repo to see how we use Markdown and MathJax.
We suggest using Jekyll to render your write-up before submission.
Core Requirements
These are the core requirements for all teams regardless of size:
- Write a 1 page description about the syntax and morphology of the language that includes
information like the following:
- What is the canonical order of the elements of the sentence (like Subject/Object/Verb,
and Adjective/Noun)?
- Does the language have free word order, or does the order tend to be fixed?
- What sort of inflectional morphology does the language use? Do verbs inflect for person,
number, tense?
- How many cases and/or genders does the language have?
- What pronouns does the language have? Are they differentiated by levels of politeness?
- The World Atlas of Language Structures online is a great source of
this information. If possible, your writeup should give examples of sentences in the
language that illustrate these properties, along with
interlinear glosses into English.
You can often find these in the grammar books that WALS cites as evidence for the
features. Grammar books are often available in the Penn library, and are sometimes
available for purchase from Google.
- Give about a half page description of who speaks the language.
How many speakers of the language are there? Where is the language spoken?
What other languages are people in that area exposed to? Are most people bilingual?
Include a map showing where the language is spoken.
- If the language uses a script other than the Roman alphabet, what script does it use?
Is it written left-to-right like English or right-to-left like Arabic?
Are there any major publications that are written in the language?
What faction of the speakers of the language can write in it?
Are there any impediments (e.g. no keyboards for the script)?
- Do any online machine translation systems exist for the language?
If so, give an example of an English translation of a paragraph from a Wikipedia
article written in that language.
Have there been any machine translation research papers that examine the language?
The Machine Translation Archive is a good place to
look for research papers.
Can you find any other existing NLP tools (for instance, parser or treebanks
or morphological analyzers)?
Additional Requirements
The number of mandatory additional requirements varies by group size.
Please pick one or more of the projects below, where the point total equals the size of your group. You can get extra credit if your point total exceeds your group size. Some of these projects involve implementing a natural language system. When you are building an NLP system for this language research assignment, you can do something very simple (like the default programs in the HW assignments).
- Harvest monolingual data written in that language. (1 point). Good sources for this might be:
- Wikipedia database dumps
- Web crawling international news services like the Voice of America, the BBC World Service, Deutsche Welle, Global Voices
- After you have found some samples of the language’s writing, try searching for some of its words using Google. Can you find web pages written in the language that way?
- Save the text in Unicode format. Include all the raw text in a single file or in a tarball with multiple files. Include a section in your writeup that gives the sources of your data, and describes how much data you were able to collect. If you had a clever way of finding more data, describe your method.
- Try to find bilingual data for the language. (1 point). Potential sources for this are:
- The Bible, The Universal Declaration of Human Rights, Global Voices
- Inter-linear glosses from grammar books about the language
- Finding a bilingual speaker of the language of the language and asking him or her to translate some sentences for you
- Save the bitext in Unicode format. Either include two files (one in English and one in your language) that are sentence-aligned with one sentence per line, or a tarball with parallel documents. Include a section in your writeup that gives the sources of your data, and describes how much data you were able to collect.
- Collect ~100 interlinear glosses from one or more grammar books about the language. (2 points). For this assignment it may be easier to get the grammar book from Google Play (some are linked from WALS). Produce a unicode CSV file with the following columns:
- The foreign sentence
- The interlinear gloss (with an equal number of words as the foreign sentence)
- A translation into English
- A label for what linguistic phenomena this sentence illustrates (this could come from the chapter heading from the grammar book, or it can be one of the labels from WALS.info).
- If the language is written in a non-Roman script, build a transliteration system for it. (2 points). Transliteration is the process of writing out how a word sounds in another script. In MT is is especially useful for out of vocabulary words which are often names that never occurred in the bilingual training data.
- The process of transliteration can be handled with the machine translation mechanisms (including using the word aligner to align characters across scripts, and the decoder to find the best transliteration). It is an easier problem since it doesn’t require reordering.
- Read this paper on transliterating from all languages by my former PhD student to see how she did it.
- Use your word aligner and decoder to do transliteration, or use an existing MT system like Moses or Joshua to do transliteration. You can download the transliterated name pairs described the paper and use them as training data.
- Add a section to your writeup saying how your transliteration system works, and do an informal evaluation of its quality.
- Note: you can potentially extend your transliteration system to be your term project.
- Build a language identification system that is able to predict whether a sample of text is written in the language or not. (2 points).
- Add a section in your writeup that describes what training data you used, how your model works, and how you evaluated its accuracy.
- You can build a model using language-model style features, or you could design a discriminative model that takes into account additional features like unicode ranges.
- Note: you can potentially extend your language ID system to be your term project.
- Does the language have a presence on Twitter? (2 points). Use the –location-query of this Twitter stream scraper to harvest tweets from the regions that speak the language. You can find latitude and longitude coordinates for a geographic region using this bounding box tool.
- How many tweets were you able to capture from that region?
- Manually label the language used in ~100 tweets. This can be a binary label that indicates whether or not a tweet is written in your language of interest.
- What fraction of the tweets in that region were written in the language?
- If you were not able to locate any tweets in that geographic region, try querying Twitter for some high frequency words in the language.
- Add a section to your write-up describing your effort to collect tweets. Save a file with the tweets that you collected (including their metadata), and a file with your labeled tweets.
- Build a named entity tagger for the language. (3 points). To train the system, you can use an existing piece of software like Stanford’s CoreNLP or OpenNLP. The key to this project will be to create annotated training data in the language that labels names of people, and possibly of locations and organizations.
- Train a named entity recognizer for your language
- Perform an informal evaluation. Does your NER system label any words that did not occur in your training data? Do those appear to be names? Alternately, hold out some of the names from your training data. Did you system spot them?
- Add a section to your writeup describing what you did.
- Note: you can potentially extend your NER system to be your term project.
- Collect or find a bilingual dictionary for the language. (1 point). You can do this by finding a native language informant to translate words for you, or you can find an online dictionary and harvest entries from it. Your dictionary should be in a CSV file with the following fields:
- The foreign word
- Its English translation
- the source of the translation
- (optional) an example sentence that uses the word
- (optional) its part of speech
- (optional) a definition
- Find a language informant at Penn who knows both English and the language that you are researching. Have the language informant assemble bilingual ~50 sentence pairs and create manual word alignments for the language. (2 points).
- Here is a word alignment interface that you could use. It is designed to be used with Amazon Mechanical Turk.
- Describe what you did in your writeup. Save the data somewhere.
- Try to locate language informants on a crowdsourcing platform like Mechanical Turk or CrowdFlower. (2 points). How can you ensure that they actually speak the language? Design a test to ensure that the crowd workers know the language:
- Try implementing a language proficiency test for the language or show them images paired with captions in the language (which you could draw from Wikipedia) and ask them to select the correct caption for the image.
- Gather some info about the people who participate, like what their native language is, how many years they have spoken English, how many years they have spoken your language, what country were born in, and what country they live in now.
- In your writeup give a description of what language tests you implemented, how many crowd workers attempted it, and how many passed.
- Have you language informant label part of speech (POS) information for the language. (2 points).
- Try to use this universal POS tag inventory developed by Slav Petrov.
- Read about the project here.
- Write up your experience trying to annotate parts of speech for the language. Did the Universal POS tags work for your language? Were any tags unused? Do you think it was missing any thing?
- Save your data in a file, and give a description of the file format.
- Create tables of inflectional paradigms for some of the words in your language. (1 point).
- What types of words inflect? Verbs? Nouns? What are the different things that they inflect for?
- Add a description to your writeup that shows how words inflect in the language. Give your tables, and carefully label the columns to show what features it reflects (things like singular versus plural, masculine versus feminine versus neuter, etc).
- Check to see if Wiktionary can help you. For some languages, it provides example tables.
Submission Details
Your final submission should consist of the following items in a compressed archive:
core.md
: Your write-up of the core requirements.
project-1/
, project-2/
, …: Directories containing the code, data, and/or other resources associated with your selected tasks.
projects.md
: A write-up discussing your selected tasks and the results you obtained. You should include for each task a brief description of the contents of the associated directory.
Your archive should be submitted using turnin
before the deadline.
Notes:
- You may submit either markdown or PDF write-ups, though markdown files are preferred.
- You should number the project directories according to their indices in the list above. For example, if you completed tasks 1, 3, and 7, you would submit the directories
project-1/
, project-3/
, and project-7/
.
Resources
Here are some useful resources for researching languages: