Machine Translation : 600.468 : Spring 2012

Instructors: Chris Callison-Burch, Adam Lopez, and Matt Post.

TAs: Jonny Weese and Juri Ganitkevitch.

Time and Place: Tuesdays and Thursdays from 3:00-4:15, Hackerman 320

Office Hours: Tuesdays and Thursdays from 4:15 (immediately after class) or by appointment.

Level: Senior undergraduate or first-year graduate.

Course Catalog Description: Google translate can instantly translate between any pair of over fifty human languages (for instance, from French to English). How does it do that? Why does it make the errors that it does? And how can you build something better? Modern translation systems like Google Translate and Bing Translator learn how to translate by reading millions of words of already translated text, and this course will show you how they work. The course covers a diverse set of fundamental building blocks from linguistics, machine learning, algorithms, data structures, and formal language theory, along with their application to a real and difficult problem in artificial intelligence.

Textbook: Some readings will be drawn from Statistical Machine Translation (errata) by Philipp Koehn You can read it online through the JHU library or or purchase from Amazon. A more compact (but therefore less thorough) survey by Adam Lopez is available for free. Note that the readings (and therefore the textbook) aren't strictly required; what's required is that you understand the concepts and are able to apply them.

Goals: By the end of the course, you should have a good grasp of what goes into a building a large-scale natural language processing system, and experience selecting and applying diverse techniques from computer science to solve real-world problems.

Requirements: You'll need strong programming skills for homeworks and a final project. Natural language processing (465) is recommended, but not required.

Course Structure and Grading

Homework (4 assignments, 10 points apiece)

Class meetings will consist mainly of lectures, but the only way to fully understand all of the problems that you need to solve in building a language processing system is to go out and build one. To that end, the course will be evaluated on four competitive homework assignments and a final project.

The goal of each homework assignment will be to build a system the solves a well-defined subproblem of machine translation. Students will earn a passing grade (7 points) by correctly implementing a standard algorithm that we specify, and additional credit for building the best system according to an objective metric: 6 points for the best system, 5 points for the second best, and so on. To receive an A in the class, you must compete! For each task we will provide a simple baseline, datasets, and metrics. The tasks are:

In-Class Presentation: Language in Ten Minutes (10 points)

How are you going to build a machine translation system unless you know at least a little bit about language? You will be required to give a short presentation (~10 minutes) on a particular language that you do not speak natively, e.g., Arabic, Chinese, Czech, Hindi, Italian, or Maltese.

You should prepare three to six slides for your presentation, covering language facts (demographics, location, etc.) important linguistic characteristics (orthography, morphology, syntax) and computational efforts such as resources, tools, papers. For instance, how many entries are there about the language in the MT Archive? and what are they generally about? Be creative and have fun. Asking for help from native speakers or language experts is great. But you are ultimately responsible for the presentation.

This assignment was inspired by Nizar Habash. You might want to browse the examples from his class, and his list of recommended resources: Ethnologue, Omniglot, About World Languages, and the Machine Translation Archive.

Presentations will be graded on thoroughness and clarity. What did you learn from your research that was really interesting? Tell us!

Final Projects (40 points)

The final project will be designed by the student or groups of students, with guidance from the instructors. As with the homework assignments, it should be on well-defined problem with clearly identified input, output, and evalution, and executed with creativity and depth.

Towards the middle of the term you will be required to turn in a brief project proposal (10 points), laying out the problem, your proposed solution, and a plan for implementation and evaluation. Your final project report (20 points) should explain your implementation, evaluation, and analysis, focusing on a single question: What did you learn? The projects will be presented during an interactive poster session during the final exam period (10 points).

Quizzes (10 points)

These are mainly designed to help us understand how well you're following along.

Tentative Schedule

Subject to change as the term progresses.

Date Topics Lecturer Readings (starred readings are strongly recommended to graduate students)

Software

State-of-the-art translation algorithms are implemented in a number of open-source projects. The most popular of these are listed below. They are all actively maintained and have significant userbases. You are free to use and extend these tools (or others) in devising your final project.
  • Joshua: a translation toolkit for syntax-based translation, developed at Johns Hopkins (Java).
  • Moses: a widely-used toolkit implementing most major translation algorithms (C++).
  • cdec: a fast decoder for a variety of translation models (C++).
  • KenLM: a fast language-modeling toolkit, can be used with the above systems (C++).
  • SRI-LM: a widely-used language modeling toolkit with many features, used with the above systems (C++).
  • Giza++: a widely-used word alignment toolkit, originally developed at a Johns Hopkins summer workshop (C++).
  • Berkeley Aligner: a robust Java implementation of several innovative alignment algorithms (Java).

Data

Modern machine translation systems work by learning from large amounts of data. Many datasets are freely available. You should use whatever data is appropriate to the problem that you decide to work on for your project.
  • Machine Translation workshop 2011 shared task data, used in research evaluations (French-English, Spanish-English, Czech-English, Haitian Creole-English).
  • JRC-Acquis, legislative text of the European Union (22 European languages).
  • Europarl, proceedings of the European Parliament (22 European languages).
  • Canadian Hansards, proceedings of the Canadian Parliament (French and English).
  • OPUS is a collection of parallel corpora in a variety of languages and domains. Includes some interesting domains such as film subtitles.

Other Resources and Classes