๐ ElanMT
ElanMT is a Japanese-to-English translation model developed by the ELAN MITSUA Project / Abstract Engine. Despite being trained with relatively limited resources, it achieves comparable performance to existing open translation models, thanks to back-translation and a newly built CC0 corpus.
๐ Quick Start
Installation
First, install the necessary Python packages:
pip install transformers accelerate sentencepiece
- This model is verified on
transformers==4.40.2
Running the Model
from transformers import pipeline
translator = pipeline('translation', model='Mitsua/elan-mt-bt-ja-en')
translator('ใใใซใกใฏใ็งใฏAIใงใใ')
Handling Longer Texts
For longer multiple sentences, it is recommended to use pySBD. Install it first:
pip install transformers accelerate sentencepiece pysbd
Then run the following code:
import pysbd
seg = pysbd.Segmenter(language="ja", clean=False)
txt = 'ใใใซใกใฏใ็งใฏAIใงใใใๅ
ๆฐใงใใ๏ผ'
print(translator(seg.segment(txt)))
This idea is from the FuguMT repo.
โจ Features
- Open-Source Training: Trained exclusively on openly licensed corpora, avoiding web-crawled or other machine-translated data.
- Back-Translation: Utilizes back-translation and a newly built CC0 corpus to achieve comparable performance with low resource training.
๐ฆ Installation
Install the required Python packages:
pip install transformers accelerate sentencepiece
- This model is verified on
transformers==4.40.2
๐ป Usage Examples
Basic Usage
from transformers import pipeline
translator = pipeline('translation', model='Mitsua/elan-mt-bt-ja-en')
translator('ใใใซใกใฏใ็งใฏAIใงใใ')
Advanced Usage
For longer texts, use pySBD:
import pysbd
seg = pysbd.Segmenter(language="ja", clean=False)
txt = 'ใใใซใกใฏใ็งใฏAIใงใใใๅ
ๆฐใงใใ๏ผ'
print(translator(seg.segment(txt)))
๐ Documentation
Model Details
This is a translation model based on Marian MT 6-layer encoder-decoder transformer architecture with sentencepiece tokenizer.
Property |
Details |
Developed by |
ELAN MITSUA Project / Abstract Engine |
Model Type |
Translation |
Source Language |
Japanese |
Target Language |
English |
License |
CC BY-SA 4.0 |
Training Data
The dataset collection heavily referred to FuguMT author's blog post.
Training Procedure
The training process and hyperparameter tuning referred to "Beating Edinburgh's WMT2017 system for en-de with Marian's Transformer model".
- Train a sentencepiece tokenizer with 32k vocab on 4M lines openly licensed corpus.
- Train an
en-ja
back-translation model on 4M lines openly licensed corpus for 6 epochs (ElanMT-base-en-ja).
- Train a
ja-en
base translation model on 4M lines openly licensed corpus for 6 epochs (ElanMT-base-ja-en).
- Translate 20M lines
en
Wikipedia to ja
using the back-translation model.
- Train 4
ja-en
models, finetuned from ElanMT-base-ja-en checkpoint, on 24M lines training data augmented with back-translated data for 6 epochs.
- Merge 4 trained models that produce the best validation score on FLORES+ dev split.
- Finetune the merged model on 1M lines high quality corpus subset for 5 epochs.
Evaluation
Dataset
Result
- *1 Tested on
transformers==4.29.2
and num_beams=4
- *2 BLEU score is calculated by
sacreBLEU
๐ License
This model is licensed under CC BY-SA 4.0.
๐ง Technical Details
This is a translation model based on Marian MT 6-layer encoder-decoder transformer architecture with sentencepiece tokenizer.
๐ License
This model is released under the CC BY-SA 4.0 license.
โ ๏ธ Important Note
The translated result may be very incorrect, harmful or biased. The model was developed to investigate achievable performance with only a relatively small, licensed corpus, and is not suitable for use cases requiring high translation accuracy. Under Section 5 of the CC BY-SA 4.0 License, ELAN MITSUA Project / Abstract Engine is not responsible for any direct or indirect loss caused by the use of the model.