You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Source code for the EMNLP 2022 main conference long paper "Entropy-Based Vocabulary Substitution for Incremental Learning in Multilingual Neural Machine Translation"
In this work, we propose an entropy-based vocabulary substitution (EVS) method that just needs to walk through new language pairs for incremental learning in a large-scale multilingual data updating while remaining the size of the vocabulary.
After obtaining the original vocabulary and the incremental vocabulary, you can run scripts for vocabulary substitution in three modes.
EVS (Ours)
frequency (choose the top-K words with the highest frequency)
combine (Expansion)
(Optional) Model Training.
This system has been tested in the following environment.
Python version == 3.7
Pytorch version == 1.8.0
Fairseq version == 0.12.0 (pip install fairseq)
Note that it only influences the training procedure of the original and incremental model. You can choose your favorite deep learning library for model training.
Incremental Learning
We build the incremental learning procedure for Multilingual Neural Machine Translation as follows:
Get original multilingual translation models (or train a multilingual translation model by yourself). We will provide two MNMT models and training scripts for reproducibility.
Data url: Permission review
Model url: Permission review
Preprocessing incremental data
Data Clean (optional, if needed)
Get Vocabulary (follow standard BPE procedure)
Get Vocabulary Feature (generated the incremental vocabulary with features, only for EVS). We will provide a vocabulary with features for the next stage, and you can also statisfy the feature on your own dataset.
Please refer to run_sh/inference.sh & run_sh/evaluate.sh
Citation
@inproceedings{huang-etal-2022-entropy,
title = "Entropy-Based Vocabulary Substitution for Incremental Learning in Multilingual Neural Machine Translation",
author = "Huang, Kaiyu and
Li, Peng and
Ma, Jin and
Liu, Yang",
booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing",
year = "2022",
publisher = "Association for Computational Linguistics",
}