The traditional formulations of the main problems in machine translation, from alignment to model extraction to decoding and through evaluation, ignore the linguistic phenomenon of morphology, instead treating all words as distinct atoms. This misses out on a number of generalizations; for example, in alignment, it could be useful to accumulate evidence across the various inflections of a verb such as walk, since walk, walked, walks, and walking all likely have related and overlapping translations.
There are two types of morphology: inflectional morphology studies how words change to reflect grammatical properties, roles, and other information, while derivational morphology describes how words changes as they are adapted to different parts of speech. Of these, inflectional morphology is the more important modeling omission in natural language generation tasks like machine translation, because choosing the right form of a word is necessary to produce grammatical output.
The inflectional morphology of English is simple. It is mostly limited to verbs and pronouns, which reflect only a subset of person, number, and one of two cases, and the forms overlap among the possible combinations. We can translate into English fairly well without bothering with morphology (an auspicious fact for the development of field).
However, this is not the case for many of the world’s languages. Languages such as Russian, Turkish, and Finnish have complex case systems that can produce hundreds of surface variations of a single lemma. The vast number of potential word forms creates data sparsity, an issue that is exacerbated by the fact that morphologically complex languages are often the ones without much in the way of parallel data.
In this assignment, you will earn an appreciation for the difficulties posed by morphology. The setting is simple: you are presented with a sequence of Czech lemmas, and your task is to choose the correct inflected form for each of them. You can imagine this as a translation task itself, except with no reordering and with a bijection between the source and target words. To support you in this task, you are provided with a parallel training corpus containing sentence pairs in both reduced and inflected forms, and a default solution chooses the most probable form for each lemma.
Start by cloning the assignment repo:
git clone https://github.com/mjpost/inflect
Change to the inflect
directory, and type the following
to create symlinks to the training and development data (and
please observe the warning that began this section):
cd inflect
bash scripts/link_data.sh
You will then find three sets of parallel files under data
:
training data (for building models), development test data
(for testing your model), and held-out test data (for
submitting to the leaderboard). Sentences
are parallel at the line level, and the words on each line
also correspond exactly across files. The parallel files
have the prefix train
, dtest
, and etest
, and the
following suffixes:
*.lemma
contains the lemmatized version of the data. Each
lemma can be inflected to one or more fully inflected
forms (that may or may not share the same surface form).
*.tag
contains a two-character sequence denoting each
word’s part of speech
*.tree
contains dependency trees, which
organize the words into a tree with words
generating their arguments. The tree format is described
below.
*.form
contains the fully inflected form. Note that we
provide dev.form
to you (the grading script needs it),
but you should not look at it or build models over
it. test.form
is kept hidden.
You should use the development data (dtest
) to test your
approaches (make sure you don’t use the answers except in
the grader). When you have something that works, you should
run it on the test data (etest
) and submit that
output. The scripts/
subdirectory contains a number of
scripts, including a grader and a default implementation
that simply chooses the most likely inflection for each
word:
# Baseline: no inflection
cat data/dtest.lemma | ./scripts/grade
# Choose the most likely inflection
cat data/dtest.lemma | ./scripts/inflect | ./scripts/grade
The evaluation method is accuracy: what percentage of the correct inflections did you choose?
Your challenge is to improve the accuracy of the inflector as much as possible. The provided implementation simply chooses the most frequent inflection computed from the lemma alone (with statistics gathered from the training data).
For a passing grade, it is sufficient to implement a bigram language model of some form (conditioned on the previous word or lemma). However, as described above, we have provided plenty more information to you that should permit much subtler approaches. Here are some suggestions:
Obviously, you should feel free to pursue other ideas. Morphology for machine translation is an understudied problem, so it’s possible you could come up with an idea that people have not tried before!
The .pos
and .tree
files contain parts of speech and
dependency trees for each sentence. Information about the
part-of-speech tags
can be found here.
Dependency trees are represented as follows. The tokens on each line correspond to the words they share an index with, and contain two pieces of information, depicted as PARENT/LABEL. PARENT is the index of the word’s parent word, and LABEL is the label of the edge implicit between those indices. Parent index 0 represents the root of the tree. Each child selects its parent, but the edge direction is from parent to child.
For example, consider the following lines, from the lemma, POS, tree, and word files (plus an English gloss), respectively:
třikrát`3 rychlý než-2 slovo
Cv AA J, NN
2/Adv 0/ExD 2/AuxC 3/ExD
Třikrát rychlejší než slovo
Three-times faster than-the-word
Line 3 here corresponds to the following dependency tree:
To avoid duplicated work, a class is provided to you that
will read the dependency structure for you, providing direct
access to each word’s head and children (if any), along with
the labels of these edges. Example usage can be found in
scripts/inflect-tree
. For a list of analytical functions
(the edge labels),
see this document.
The output of your inflector on both the dev and test sets
(data/dtest.lemma
and data/etest.lemma
, concatenated together), uploaded to the
leaderboard submission site
according to the Assignment 0
instructions. You can upload new output as often as
you like, up until the assignment deadline.
Your output file should have 8,714 lines.
Credits: This assignment was designed for this course by Matt Post. The data used in the assignment comes from the Prague Dependency Treebank v2.0