Machine Translation is an important task in Natural Language Processing. Machine Translation has a lot of potentials, it helps to solve communication problems that occur in business, tourism, education, and health,...
For NLP enthusiasts like me, Machine Translation is a very interesting topic, so I decided to undertake this task to find out what's happening inside it. In this task, I use Neural Machine Translation.
I use the IWSLT'15 English-Vietnamese dataset. This is a small dataset consisting of 133K pairs of train sentences, more than 1500 pairs of test sentences in 2012, and more than 1300 pairs of test sentences in 2013. In my opinion, this dataset is not high quality.
This source code was written by me based on Stanford Neural Machine Translation System for Language Domains spoken language paper.
The model I use is the Seq2Seq model combined with 1 layer of Attention, about the Attention mechanisms, I choose Luong's Attention. Model details:
- Encoder: I use 2-layer LSTMs of 512 units with bidirectional (i.e., 1 bidirectional layer for the encoder) and with dropout keep_prob of 0.8. The 2-layers are combined by the 'sum' method.
- Decoder: I use 2-layer LSTMs of 512 units stacked (i.e., the output of the previous layer is the input of the next layer) and with dropout keep_prob of 0.8.
- Attention layer: There are many Attention mechanisms, but I use Luong's Attention because it has many advantages such as easy to implement, easy to understand and adjust, and scalability for different types of Encoder.
- Pre-processing: these processing steps I have done in vocabulary.py, language.py, lang_utils, dataloader.py.
- Post-processing: these processing steps I have done in inference.py.
I train the model in 20 epochs with an initial learning rate is 0.0008. After 10 epochs, halve the learning rate after each epoch. The optimizer used here is Adam, and the max norm value used to clip the gradient is 1.0. The training files consist of loss.py, custom_lr.py, train_utils.py, train.py.
To evaluate an NMT model, I use the BLEU score. Here is the quality evaluation scale of NMT based on the BLEU score:
BLEU Score | Interpretation |
---|---|
<10 | Almost useless |
10-19 | Hard to get the gist |
20-29 | The gist is clear, but has significant grammatical errors |
30-40 | Understandable to good translations |
40-50 | High quality translations |
50-60 | Very high quality, adequate, and fluent translations |
> 60 | Quality often better than human |
My model's BLEU score in the 2012 test set and 2013 test set is 22.21 and 24.15.
- Data quality is not high: as mentioned, the Stanford dataset is not high quality because of several reasons: the number of sentences in the dataset is small, the translation quality is not uniform,...
- Language complexity: Vietnamese is a complex language because it has many unique and complex features in terms of vocabulary usage, grammar, and sentence structure.
- Computer resources are limited: Low GPU RAM. Therefore, during the preprocessing step, longer sentences than the initialized max_len must be truncated, leading to the loss of information and coherence within the sentence.
- Model architecture: the Seq2Seq model often has difficulties in handling long-distance relationships between words in sentences, is limited in handling too long sentences, and challenges in matching input and output,...
Currently, many models have given very good results in Machine Translation tasks such as Transformer, GPT-2, GPT-3, and T5. But for an NLP enthusiast like me, understanding and practicing old models like Seq2Seq will greatly improve my knowledge in this task. So this is the first model I choose to solve the Machine Translation task.
Thank you a lot for the finding! 😊