Machine Translation : 600.468 : Spring 2012

Instructors: Chris Callison-Burch, Adam Lopez, and Matt Post.

TAs: Jonny Weese and Juri Ganitkevitch.

Time and Place: Tuesdays and Thursdays from 3:00-4:15, Shaffer 202

Office Hours: Tuesdays and Thursdays from 4:15 (immediately after class) or by appointment.

We wrote a TACL paper about the outcome of our competitive homework assignments. Please cite this paper if you reuse the assignments in academic work.
Class is over! If you'd like to reuse any material on this site, you can get it on github. It's licensed under a Creative Commons Attribution 3.0 Unported License, which means you're free to reuse it any way you like, as long as you acknowledge that you got it from us (Adam Lopez, Chris Callison-Burch, and Matt Post).

Level: Senior undergraduate or first-year graduate.

Course Catalog Description: Google translate can instantly translate between any pair of over fifty human languages (for instance, from French to English). How does it do that? Why does it make the errors that it does? And how can you build something better? Modern translation systems like Google Translate and Bing Translator learn how to translate by reading millions of words of already translated text, and this course will show you how they work. The course covers a diverse set of fundamental building blocks from linguistics, machine learning, algorithms, data structures, and formal language theory, along with their application to a real and difficult problem in artificial intelligence.

Textbook: Some readings will be drawn from Statistical Machine Translation (errata) by Philipp Koehn You can read it online through the JHU library or or purchase from Amazon. A more compact (but therefore less thorough) survey by Adam Lopez is available for free. Note that the readings (and therefore the textbook) aren't strictly required; what's required is that you understand the concepts and are able to apply them.

Goals: By the end of the course, you should have a good grasp of what goes into a building a large-scale natural language processing system, and experience selecting and applying diverse techniques from computer science to solve real-world problems.

Requirements: You'll need strong programming skills for homeworks and a final project. Natural language processing (465) is recommended, but not required.

Course Structure and Grading

Homework (4 assignments, 10 points apiece)

Class meetings will consist mainly of lectures, but the only way to fully understand all of the problems that you need to solve in building a language processing system is to go out and build one. To that end, the course will be evaluated on four competitive homework assignments and a final project.

The goal of each homework assignment will be to build a system the solves a well-defined subproblem of machine translation. Students will earn a passing grade (7 points) by correctly implementing a standard algorithm that we specify, and additional credit for building the best system according to an objective metric: 6 points for the best system, 5 points for the second best, and so on. To receive an A in the class, you must compete! For each task we will provide a simple baseline, datasets, and metrics. The tasks are:

Reranking: find the most accurate translation of a sentence, given an input and a list of ranked alternative translations.

In-Class Presentation: Language in Ten Minutes (10 points)

How are you going to build a machine translation system unless you know at least a little bit about language? You will be required to give a short presentation (~10 minutes) on a particular language that you do not speak natively, e.g., Arabic, Chinese, Czech, Hindi, Italian, or Maltese.

You should prepare three to six slides for your presentation, covering language facts (demographics, location, etc.) important linguistic characteristics (orthography, morphology, syntax) and computational efforts such as resources, tools, papers. For instance, how many entries are there about the language in the MT Archive? and what are they generally about? Be creative and have fun. Asking for help from native speakers or language experts is great. But you are ultimately responsible for the presentation.

This assignment was inspired by Nizar Habash. You might want to browse the examples from his class, and his list of recommended resources: Ethnologue, Omniglot, About World Languages, and the Machine Translation Archive.

Presentations will be graded on thoroughness and clarity. What did you learn from your research that was really interesting? Tell us!

Final Projects (40 points)

The final project will be designed by the student or groups of students, with guidance from the instructors. As with the homework assignments, it should be on well-defined problem with clearly identified input, output, and evalution, and executed with creativity and depth.

Towards the middle of the term you will be required to turn in a brief project proposal (10 points), laying out the problem, your proposed solution, and a plan for implementation and evaluation. Your final project report (20 points) should explain your implementation, evaluation, and analysis, focusing on a single question: What did you learn? The projects will be presented during an interactive poster session during the final exam period (10 points).

The project proposal should be 1-2 pages (there is no hard limit, but it will take us longer to give you feedback if your proposal is long or unclear) and must clearly identify:

A single question or problem related to machine translation. This should be stated in the first paragraph. We strongly advise including some simple examples to illustrate the question or problem.
An outline of the work to be done: how will your project answer the question or attempt to solve the problem? What models and algorithms will you implement? What software will you use?
A description of planned experiments: how will you know if the question was answered or the problem was solved? You should clearly identify input, output, and evaluation strategy.

The proposal is a contract. If we give you full credit for it, that means we expect you to implement it and do a good analysis of the results, and we will give you full credit for the entire project if you do. If you turn in a weak proposal, we will give you the opportunity to submit a revised one before moving forward, but the longer you take to define your project, the less time you will have to implement it, so it's in your best interest to take advantage of this early checkpoint.

Before the proposal is due, you should make an appointment with one of the instructors in order to discuss project ideas; this will enable you to submit a proposal with full confidence that it will be well-received. Before meeting with us, you might want to browse over topics that we'll be covering later in the term, since these might suggest ideas to you. We will however give you fairly wide latitude to choose a topic as long as it's related in some way to translation and is technically interesting, so you should not feel restricted to these topics. We can suggest topics to you in individual meetings if you're stumped, but it will help us to know what your interests and strengths are, so be prepared to tell us what you're curious about.

Groups projects of any size are permitted, but we will require an amount of work that is linear in group size, so you should take into account the overhead of group coordination when forming groups. Each group should turn in a single proposal identifying all members. All group members will receive the same grade, and you are stuck with your group members once your proposal is finalized: we refuse to adjudicate stories about who did or did not contribute. Choose your partners carefully.

Quizzes (10 points)

These are mainly designed to help us understand how well you're following along.

Tentative Schedule

Subject to change as the term progresses.

Date	Topics	Lecturer	Readings (*=graduate level)
Jan 31	Introduction [pdf] [keynote]	All	Koehn, chapter 1 * Knight, Automating Knowledge Acquisition for Machine Translation * Weaver, Translation * Kay, Translation
Feb 02	Probability and Language Models [pdf] [keynote]	Lopez	Koehn, chapters 2 and 7
Feb 07	Learning Translation Models: Word Alignment [pdf] [keynote]	Lopez	Koehn, chapter 3 Knight, A Tutorial MT Workbook Collins, Statistical Machine Translation: IBM Models 1 and 2 * Brown et al., A Statistical Approach to Machine Translation
Feb 09	Learning Better Translation Models [pdf] [keynote] Language in 10: Afrikaans	Lopez	Koehn, chapter 4 * Brown et al., The Mathematics of Statical Machine Translation: Parameter Estimation * Vogel and Ney, HMM-based word alignment in statistical translation * Liang et al., Alignment by Agreement
Feb 14	Decoding: Predicting Translations [pdf] [keynote] [code]	Post	Koehn, chapter 6 * Germann et al., Fast Decoding and Optimal Decoding for Machine Translation
Feb 16	Decoding continued [pdf] [keynote] [live demo]	Post	* Knight, Decoding Complexity in Word-Replacement Translation Models
Feb 21	Phrase-based Models [pdf] [keynote]	Lopez	Koehn, sections 5.1-5.2 * Koehn et al., Statistical Phrase-Based Translation * Marcu and Wong, A Phrase-Based, Joint Probability Model for Statistical Machine Translation * DeNero et al., Sampling Alignment Structure under a Bayesian Translation Model
Feb 23	Evaluating Translation Systems [pdf] [keynote] Language in 10: German	Callison-Burch	Koehn, chapter 8 * Papineni et al., Bleu: a Method for Automatic Evaluation of Machine Translation * Callison-Burch et al., Re-Evaluating the Role of BLEU in Machine Translation Research
Feb 28	Feature-Based Models [pdf] [keynote]	Lopez	Koehn, chapter 9
Mar 01	Loss-Sensitive Training of Feature-Based Models	Lopez	* Och, Minimum Error Rate Training in Statistical Machine Translation * Hopkins and May, Tuning as Ranking
Mar 06	Weighted Automata [pdf] [keynote]	Lopez	* Mohri, Finite-State Transducers in Language and Speech Processing
Mar 08	Modeling Translation with Weighted Automata [pdf] [keynote]	Lopez	Knight and Al-Onaizan, Translation with Finite-State Devices * Kumar et al., A weighted finite state transducer translation template model for statistical machine translation
Mar 13	Syntax-based Translation Part 1: Reordering for Phrase-Based Translation [pdf] [keynote]	Callison-Burch	Collins et al., Clause Restructuring for Statistical Machine Translation Chiang, An Introduction to Synchronous Grammars * Wu, Stochastic Inversion Transduction Grammars and Bilingual Parsing of Parallel Corpora
Mar 15	Syntax-Based Translation Part 2: Synchronous Grammars [pdf] [keynote]	Callison-Burch	Yamada and Knight, Syntax-based Statistical Machine Translation Heidi Fox, Phrasal Cohesion and Statistical Machine Translation * Galley et al., What’s in a Translation Rule?
Mar 27	Syntax-Based Decoding [pdf] [keynote]	Post	Chiang, Hierarchical Phrase-based Translation
Mar 29	Syntax-Based Decoding with Weighted Automata [pdf] [keynote]	Lopez	* Iglesias et al., Hierarchical Phrase-Based Translation with Weighted Finite State Transducers * Iglesias et al., Hierarchical Phrase-Based Translation Representations
Apr 03	Creative Data Collection: Crowdsourcing Translation [pdf] [keynote]	Callison-Burch	Zaidan and Callison-Burch, Crowdsourcing Translation: Professional Quality from Non-Professionals * Oard et al., Desperately Seeking Cebuano
Apr 05	More Data Collection: Harvesting Translations from the Web [pdf] [keynote]	Callison-Burch	Uszkoreit et al, Large Scale Parallel Document Mining for Machine Translation * Smith and Resnik, The Web As A Parallel Corpus * Venugopal et al., Watermarking the Outputs of Structured Prediction with an application in Statistical Machine Translation
Apr 10	Guest Lecture: Translation with Comparable Corpora	Smith	* Munteanu and Marcu, Improving Machine Translation Performance by Exploiting Non-Parallel Corpora * Smith et al., Extracting Parallel Sentences from Comparable Corpora using Document Level Alignment
Apr 12	Syntax-Based Language Models [pdf] [keynote]	Post	Chelba and Jelinek, Exploiting Syntactic Structure for Language Modeling Charniak et al., Syntax-based language models for statistical machine translation
Apr 17	Representing Huge Translation Models [pdf] [keynote]	Lopez	Callison-Burch et al., Scaling Phrase-Based Statistical Machine Translation to Larger Corpora and Longer Phrases Lopez, Hierarchical Phrase-Based Translation with Sufﬁx Arrays
Apr 19	Applications of Word Alignment	Weese
Apr 24	Morphology and Translation [pdf] [keynote]	Post	Koehn and Knight, Empirical Methods for Compound Splitting Koehn and Hoang, Factored Translation Models
Apr 26	Representing Huge Language Models	Lopez	Brants et al., Large Language Models in Machine Translation Talbot and Brants, Randomized Language Models via Perfect Hash Functions
May 01	Paraphrasing [pdf] [keynote]	Callison-Burch	Callison-Burch et al., Improved Statistical Machine Translation Using Paraphrases
May 03	Course Wrap-up	All

Software

State-of-the-art translation algorithms are implemented in a number of open-source projects. The most popular of these are listed below. They are all actively maintained and have significant userbases. You are free to use and extend these tools (or others) in devising your final project.

Joshua: a translation toolkit for syntax-based translation, developed at Johns Hopkins (Java).
Moses: a widely-used toolkit implementing most major translation algorithms (C++).
cdec: a fast decoder for a variety of translation models (C++).
KenLM: a fast language-modeling toolkit, can be used with the above systems (C++).
SRI-LM: a widely-used language modeling toolkit with many features, used with the above systems (C++).
Giza++: a widely-used word alignment toolkit, originally developed at a Johns Hopkins summer workshop (C++).
Berkeley Aligner: a robust Java implementation of several innovative alignment algorithms (Java).

Data

Modern machine translation systems work by learning from large amounts of data. Many datasets are freely available. You should use whatever data is appropriate to the problem that you decide to work on for your project.

Machine Translation workshop 2011 shared task data, used in research evaluations (French-English, Spanish-English, Czech-English, Haitian Creole-English).
JRC-Acquis, legislative text of the European Union (22 European languages).
Europarl, proceedings of the European Parliament (22 European languages).
Canadian Hansards, proceedings of the Canadian Parliament (French and English).
OPUS is a collection of parallel corpora in a variety of languages and domains. Includes some interesting domains such as film subtitles.

Other Resources and Classes

The ACL Anthology archives papers published by the Association for Computational Linguistics, which covers a wide variety of topics in natural language processing. It includes many of the classic papers on machine translation.
The MT Archive holds historical and modern research papers on machine translation. There is some overlap with the ACL Anthology, but it is focused specifically on machine translation, and also includes many papers from other venues, as well as historical papers..
Philipp Koehn maintains statmt.org, with pointers to various resources.
Jason Eisner's natural language processing class at JHU.
Mark Dredze's machine learning class at JHU.
Chris Callison-Burch taught a one-week machine translation course at ESSLLI 2005 with Philipp Koehn: 1, 2, 3, 4, 5.
Adam Lopez taught a one-week machine translation course at ESSLLI 2010.
Machine translation course at University of Southern California.
Machine translation course at University of Edinburgh.
Machine translation course at University of Washington.
Machine translation course at Carnegie Mellon University.
Machine translation course at Columbia University.
Machine translation course at Simon Fraser University.
A course in advanced topics in machine translation at Carnegie Mellon University runs concurrently with this class.