Due: November 17th, 2017
Imagine it’s 2022 and you are training a German-English neural machine translation (NMT) system. Suddenly, a nuclear bomb is detonated in Los Angeles and caused a massive blackout. As a result, all the neural machine translation codebase in the world is destroyed. Fortunately, you placed your data and the most recent model dump in this Google drive folder, which is preserved during the blackout.
Your challenge is to build a neural machine translation system with the best possible translation quality with these resources. To save your computation resource usage, you can start training by loading the preserved model dump. Because this model is not trained until convergence, you can definitely improve over this model by running a few more training iterations (which is normally called “continued training”) and beat the baseline. (To give you an idea of how large the data is, it’ll take about 10 hours to run the full training on one Tesla K80 GPU of Google Cloud Platform. The model dump you have has been trained for about 6 hours.)
However, to be able to run continued training, you need to restore the NMT codebase by implementing your own neural machine translation model. The model we are looking for you to implement is a attention-based neural machine translation model, which is described by (Bahdanau et al. 2015) and (Luong et al. 2015) as well as (Koehn 2017). You will then run contineud training starting with your model dump and test your final model by translating the test set and submit the translation to the leaderboard submission site (Don’t forget to run post-processing before you submitt – see Setup section). To make your life easier, you don’t have to implement beam search to beat the baseline model. After that, you may earn full credit by implementing at least one more improvement over the baseline attention-based neural machine translation model. Here are several ideas:
All the parts of this homework, including baseline and improvements, DO NOT subject to limits on usage of deep learning frameworks (of course, you cannot use a module that implements the seq2seq model for you).
Neural machine translation is a very active resesarch field in recent years and there are new ideas popping up every day, so be creative and don’t be constrained by the list above! If your idea does not fit into the two-week homework span, you can consider making this your final project. Note that some of these ideas may require you to train part or all the model from scratch, so it may take longer than the original continued training.
Pull or check out the starter code. Download the necessary data files from here.
As you have seen in homework 4, you need to preprocess the training data into a list of word indexes. The only difference for this homework is that you need to do this respectively for German and English.
python preprocess.py --train_file trn.de --dev_file dev.de --test_file devtest.de --vocab_file model.src.vocab --data_file hw5.de
python preprocess.py --train_file trn.en --dev_file dev.en --test_file devtest.fake.en --vocab_file model.trg.vocab --data_file hw5.en --charniak
The format of the data is the same as the last homework. The data dump on each side of the parallel corpus contains a tuple (train_data, dev_data, test_data, vocab)
, each data being a list of torch tensors of size (sent_len,)
, while the vocabulary is an instance of torchtext.vocab.Vocab
.
A starter code (train.py
) has been provided for you, which is also pretty similar from what you have seen in the last homework.
Because Google App Engine does not run bash scripts, you’ll also have to run post-processing script yourself before you submit to the leaderboard. Here is out you do it:
./postprocess.sh < [system output] > [post-processed output]
The file model.param
is a Python dictionary that contains the model dump that’s been trained for 7 epochs on the data provided to you. Although the keys should be self-explanatory, here is a brief description of what they are:
encoder.embeddings.emb_luts.0.weight
torch.Size([src_vocab_size, src_word_emb_size] = [36616, 300]): the source word embeddingdecoder.embeddings.emb_luts.0.weight
torch.Size([trg_vocab_size, trg_word_emb_size] = [23262, 300]): the target word embeddingencoder.rnn.weight_ih_l0
torch.Size([4 * encoder_hidden_size, src_word_emb_size] = [2048, 300]): the input connection to the gates of the LSTM, see here for how the weights are arrangedencoder.rnn.weight_hh_l0
torch.Size([4 * encoder_hidden_size, encoder_hidden_size] = [2048, 512]): the hidden connection to the gates of the LSTM, see here for how the weights are arrangedencoder.rnn.bias_ih_l0
torch.Size([4 * encoder_hidden_size] = [2048]): bias term for the input connections, same arrangement as aboveencoder.rnn.bias_hh_l0
torch.Size([4 * encoder_hidden_size] = [2048]): bias term for the hidden connections, same arrangement as aboveencoder.rnn.weight_ih_l0_reverse
torch.Size([2048, 300])encoder.rnn.weight_hh_l0_reverse
torch.Size([2048, 512])encoder.rnn.bias_ih_l0_reverse
torch.Size([2048])encoder.rnn.bias_hh_l0_reverse
torch.Size([2048])The decoder is yet another LSTM, as the encoders. However, a slight difference is that the input to the decoder LSTM is the concatenation of context vector and the target side word embedding of the previous output word. The context vector is the output of the attention layer, which you’ll learn about in the subsequent sections.
decoder.rnn.layers.0.weight_ih
torch.Size([4 * decoder_hidden_size, trg_word_emb_size + context_vector_size] = [4096, 1324])decoder.rnn.layers.0.weight_hh
torch.Size([4 * decoder_hidden_size, decoder_hidden_size] = [4096, 1024])decoder.rnn.layers.0.bias_ih
torch.Size([4 * decoder_hidden_size] = [4096])decoder.rnn.layers.0.bias_hh
torch.Size([4 * decoder_hidden_size] = [4096])The generator is the mapping from decoder hidden state to vocabulary distrbution. For the baseline model and training, it’s just a affine transformation (use nn.Linear
). But if you want to do beam search, expect it to be more complicated.
0.weight
torch.Size([trg_vocab_size, decoder_hidden_size] = [23262, 1024])0.bias
torch.Size([trg_vocab_size] = [23262])The global general attention described in (Luong et al. 2015) was used in the model, which has some extra complication than the attention described in (Bahdanau et al. 2015). Here is the basic idea of it.
You should have known from the lecture that the Bahdanau attention constructs a summary of the source side information by carrying out a weighted sum over the source side encodings (not word embeddings!), with the weight defined by the content of the decoder hidden state \(h_{t-1}\) at time step \(t-1\) and the source side encoding \(h_s\) of word \(s\). The Luong attention, while still maintaining this weighted sum mechanism, carry out an extra transformation following that. Below we will call the weighted sum of the source side the encoding \(\tilde{s_t}\) and the final summary of the source side information \(c_t\), or context vector. Both of these \(t\) are refering to time step \(t\).
Easily enough, here is how \(s_t\) computed.
\[\tilde{s_t} = \sum_{s=0}^{\mid S\mid} a(h_s, h_{t-1}) * h_s\] \[a(h_s, h_{t-1}) = \dfrac{\exp(score(h_s, h_{t-1}))}{\sum_{s=0}^{\mid S\mid} \exp(score(h_s, h_{t-1}))}\]Note that each \(h_s\) is the concatenation of forward and backward encoding, so its dimension is 2 * encoder_hidden_size = 1024
.
This weighted sum \(s_t\) is then combined with previous decoder hidden state \(h_{t-1}\) again to construct the context vector \(c_t\) at timestep \(t\).
\[c_t = tanh(W_o [\tilde{s_t}; h_{t-1}])\]where the semicolon denotes concatenation. We are using decoder hidden state size 1024 and context vector 1024, so the input of this linear transformation is of dimension 2 * encoder_hidden_size + decoder_hidden_size = 2048
while the output is context_vector_size = 1024
. The question now is: how do we compute \(score(h_s, h_{t-1})\)? The global attention calculates it in the following way:
Hence, you need the following two weights to implement global attention.
decoder.attn.linear_in.weight
torch.Size([1024, 1024]): \(W_i\)decoder.attn.linear_out.weight
torch.Size([1024, 2048]): \(W_o\)To avoid further nuclear attacks, your supervisor from LAPD has instructed you to use Google Cloud to run GPU training. Please read through this note so you know how to use the Google Cloud GPU host.
Please keep in mind that you should use your GPU hours wisely since the $50 credit will only give you ~60 hours of GPU usage. Always debug on CPU before running your program on GPU!
We understand people may have different preferences for deep learning frameworks, but as new frameworks come out everyday, it is not possible for us as instructors and TAs to cover knowledge of all the frameworks. As of 2017, we will use PyTorch for all the starter code and is only able to help if you are using PyTorch. PyTorch is quickly gaining popularity among NLP/MT/ML research community and is, from our perspective, relatively easy to pick up as a beginner.
If you prefer using another framework, we welcome contribution of your starter code (and you’ll get credit as contributors in a homework that’ll likely to be used in the coming years), but again, we are not able to help if you run into problems with the framework.
Your choice of framework will not affect the grade of your homework.
Your translation of the whole test set, uploaded to the leaderboard submission site. You can upload new output as often as you like, up until the assignment deadline.
*Credits: This assignment was developed by Shuoyang Ding. The idea of blackout was borrowed from the anime short Blade Runner: Blackout 2022.