Machine Translation : 600.468 : Spring 2012

Instructors: Chris Callison-Burch, Adam Lopez, and Matt Post.

TAs: Jonny Weese and Juri Ganitkevitch.

Time and Place: Tuesdays and Thursdays from 3:00-4:15, Shaffer 202

Office Hours: Tuesdays and Thursdays from 4:15 (immediately after class) or by appointment.

  • We wrote a TACL paper about the outcome of our competitive homework assignments. Please cite this paper if you reuse the assignments in academic work.
  • Class is over! If you'd like to reuse any material on this site, you can get it on github. It's licensed under a Creative Commons Attribution 3.0 Unported License, which means you're free to reuse it any way you like, as long as you acknowledge that you got it from us (Adam Lopez, Chris Callison-Burch, and Matt Post). Creative Commons License

Level: Senior undergraduate or first-year graduate.

Course Catalog Description: Google translate can instantly translate between any pair of over fifty human languages (for instance, from French to English). How does it do that? Why does it make the errors that it does? And how can you build something better? Modern translation systems like Google Translate and Bing Translator learn how to translate by reading millions of words of already translated text, and this course will show you how they work. The course covers a diverse set of fundamental building blocks from linguistics, machine learning, algorithms, data structures, and formal language theory, along with their application to a real and difficult problem in artificial intelligence.

Textbook: Some readings will be drawn from Statistical Machine Translation (errata) by Philipp Koehn You can read it online through the JHU library or or purchase from Amazon. A more compact (but therefore less thorough) survey by Adam Lopez is available for free. Note that the readings (and therefore the textbook) aren't strictly required; what's required is that you understand the concepts and are able to apply them.

Goals: By the end of the course, you should have a good grasp of what goes into a building a large-scale natural language processing system, and experience selecting and applying diverse techniques from computer science to solve real-world problems.

Requirements: You'll need strong programming skills for homeworks and a final project. Natural language processing (465) is recommended, but not required.

Course Structure and Grading

Homework (4 assignments, 10 points apiece)

Class meetings will consist mainly of lectures, but the only way to fully understand all of the problems that you need to solve in building a language processing system is to go out and build one. To that end, the course will be evaluated on four competitive homework assignments and a final project.

The goal of each homework assignment will be to build a system the solves a well-defined subproblem of machine translation. Students will earn a passing grade (7 points) by correctly implementing a standard algorithm that we specify, and additional credit for building the best system according to an objective metric: 6 points for the best system, 5 points for the second best, and so on. To receive an A in the class, you must compete! For each task we will provide a simple baseline, datasets, and metrics. The tasks are:

In-Class Presentation: Language in Ten Minutes (10 points)

How are you going to build a machine translation system unless you know at least a little bit about language? You will be required to give a short presentation (~10 minutes) on a particular language that you do not speak natively, e.g., Arabic, Chinese, Czech, Hindi, Italian, or Maltese.

You should prepare three to six slides for your presentation, covering language facts (demographics, location, etc.) important linguistic characteristics (orthography, morphology, syntax) and computational efforts such as resources, tools, papers. For instance, how many entries are there about the language in the MT Archive? and what are they generally about? Be creative and have fun. Asking for help from native speakers or language experts is great. But you are ultimately responsible for the presentation.

This assignment was inspired by Nizar Habash. You might want to browse the examples from his class, and his list of recommended resources: Ethnologue, Omniglot, About World Languages, and the Machine Translation Archive.

Presentations will be graded on thoroughness and clarity. What did you learn from your research that was really interesting? Tell us!

Final Projects (40 points)

The final project will be designed by the student or groups of students, with guidance from the instructors. As with the homework assignments, it should be on well-defined problem with clearly identified input, output, and evalution, and executed with creativity and depth.

Towards the middle of the term you will be required to turn in a brief project proposal (10 points), laying out the problem, your proposed solution, and a plan for implementation and evaluation. Your final project report (20 points) should explain your implementation, evaluation, and analysis, focusing on a single question: What did you learn? The projects will be presented during an interactive poster session during the final exam period (10 points).

The project proposal should be 1-2 pages (there is no hard limit, but it will take us longer to give you feedback if your proposal is long or unclear) and must clearly identify:

  • A single question or problem related to machine translation. This should be stated in the first paragraph. We strongly advise including some simple examples to illustrate the question or problem.
  • An outline of the work to be done: how will your project answer the question or attempt to solve the problem? What models and algorithms will you implement? What software will you use?
  • A description of planned experiments: how will you know if the question was answered or the problem was solved? You should clearly identify input, output, and evaluation strategy.

The proposal is a contract. If we give you full credit for it, that means we expect you to implement it and do a good analysis of the results, and we will give you full credit for the entire project if you do. If you turn in a weak proposal, we will give you the opportunity to submit a revised one before moving forward, but the longer you take to define your project, the less time you will have to implement it, so it's in your best interest to take advantage of this early checkpoint.

Before the proposal is due, you should make an appointment with one of the instructors in order to discuss project ideas; this will enable you to submit a proposal with full confidence that it will be well-received. Before meeting with us, you might want to browse over topics that we'll be covering later in the term, since these might suggest ideas to you. We will however give you fairly wide latitude to choose a topic as long as it's related in some way to translation and is technically interesting, so you should not feel restricted to these topics. We can suggest topics to you in individual meetings if you're stumped, but it will help us to know what your interests and strengths are, so be prepared to tell us what you're curious about.

Groups projects of any size are permitted, but we will require an amount of work that is linear in group size, so you should take into account the overhead of group coordination when forming groups. Each group should turn in a single proposal identifying all members. All group members will receive the same grade, and you are stuck with your group members once your proposal is finalized: we refuse to adjudicate stories about who did or did not contribute. Choose your partners carefully.

Quizzes (10 points)

These are mainly designed to help us understand how well you're following along.

Tentative Schedule

Subject to change as the term progresses.

DateTopicsLecturerReadings (*=graduate level)
Jan 31 All
Feb 02 Lopez
  • Koehn, chapters 2 and 7
Feb 07 Lopez
Feb 09 Lopez
Feb 14 Post
Feb 16 Post
Feb 21 Lopez
Feb 23 Callison-Burch
Feb 28 Lopez
  • Koehn, chapter 9
Mar 01
  • Loss-Sensitive Training of Feature-Based Models
Lopez
Mar 06 Lopez
Mar 08 Lopez
Mar 13
  • Syntax-based Translation Part 1: Reordering for Phrase-Based Translation
  • [pdf] [keynote]
Callison-Burch
Mar 15
  • Syntax-Based Translation Part 2: Synchronous Grammars
  • [pdf] [keynote]
Callison-Burch
Mar 27 Post
Mar 29 Lopez
Apr 03 Callison-Burch
Apr 05
  • More Data Collection: Harvesting Translations from the Web
  • [pdf] [keynote]
Callison-Burch
Apr 10
  • Guest Lecture: Translation with Comparable Corpora
Smith
Apr 12 Post
Apr 17 Lopez
Apr 19
  • Applications of Word Alignment
Weese
Apr 24 Post
Apr 26
  • Representing Huge Language Models
Lopez
May 01 Callison-Burch
May 03
  • Course Wrap-up
All

Software

State-of-the-art translation algorithms are implemented in a number of open-source projects. The most popular of these are listed below. They are all actively maintained and have significant userbases. You are free to use and extend these tools (or others) in devising your final project.
  • Joshua: a translation toolkit for syntax-based translation, developed at Johns Hopkins (Java).
  • Moses: a widely-used toolkit implementing most major translation algorithms (C++).
  • cdec: a fast decoder for a variety of translation models (C++).
  • KenLM: a fast language-modeling toolkit, can be used with the above systems (C++).
  • SRI-LM: a widely-used language modeling toolkit with many features, used with the above systems (C++).
  • Giza++: a widely-used word alignment toolkit, originally developed at a Johns Hopkins summer workshop (C++).
  • Berkeley Aligner: a robust Java implementation of several innovative alignment algorithms (Java).

Data

Modern machine translation systems work by learning from large amounts of data. Many datasets are freely available. You should use whatever data is appropriate to the problem that you decide to work on for your project.
  • Machine Translation workshop 2011 shared task data, used in research evaluations (French-English, Spanish-English, Czech-English, Haitian Creole-English).
  • JRC-Acquis, legislative text of the European Union (22 European languages).
  • Europarl, proceedings of the European Parliament (22 European languages).
  • Canadian Hansards, proceedings of the Canadian Parliament (French and English).
  • OPUS is a collection of parallel corpora in a variety of languages and domains. Includes some interesting domains such as film subtitles.

Other Resources and Classes