Elan-mt-bt-ja-en Open-source Japanese-English Translation Model - Achieve Accurate Japanese-English Translation with Free Deployment

Elan Mt Bt Ja En

Developed by Mitsua

ElanMT-BT-ja-en is a Japanese-to-English translation model developed by the ELAN MITSUA Project/Abstract Engine, trained exclusively using open-license data and back-translated Wikipedia data.

Machine Translation

Transformers

Supports Multiple Languages#Japanese-English Translation #Open License Data Training #Back-Translation Enhancement

Downloads 502

Release Time : 5/20/2024

Model Overview

This model is a Japanese-to-English translation model based on the Marian MT architecture, focusing on training with open-license data while avoiding the use of web-scraped or other machine translation corpora.

Model Features

Open Data Training

Trained exclusively using open-license corpora such as CC0, CC BY, and CC BY-SA, avoiding copyright issues.

Back-Translation Enhancement

Enhanced training data through back-translation models, improving translation quality.

High-Quality Vocabulary Performance

A newly constructed 1.5 million-line Wikipedia parallel corpus significantly improves vocabulary-level performance.

Model Capabilities

Japanese-to-English text translation

Multi-sentence text processing

Use Cases

Text Translation

Japanese-to-English Document Translation

Translate Japanese documents into English, suitable for open-license content translation needs.

Performs well on FLORES+ and NTREX datasets, achieving BLEU scores of 24.87 and 22.57, respectively.

🚀 ElanMT

ElanMT is a Japanese-to-English translation model developed by the ELAN MITSUA Project / Abstract Engine. Despite being trained with relatively limited resources, it achieves comparable performance to existing open translation models, thanks to back-translation and a newly built CC0 corpus.

🚀 Quick Start

Installation

First, install the necessary Python packages:

pip install transformers accelerate sentencepiece

This model is verified on transformers==4.40.2

Running the Model

from transformers import pipeline
translator = pipeline('translation', model='Mitsua/elan-mt-bt-ja-en')
translator('こんにちは。私はAIです。')

Handling Longer Texts

For longer multiple sentences, it is recommended to use pySBD. Install it first:

pip install transformers accelerate sentencepiece pysbd

Then run the following code:

import pysbd
seg = pysbd.Segmenter(language="ja", clean=False)
txt = 'こんにちは。私はAIです。お元気ですか？'
print(translator(seg.segment(txt)))

This idea is from the FuguMT repo.

✨ Features

Open-Source Training: Trained exclusively on openly licensed corpora, avoiding web-crawled or other machine-translated data.
Back-Translation: Utilizes back-translation and a newly built CC0 corpus to achieve comparable performance with low resource training.

📦 Installation

Install the required Python packages:

pip install transformers accelerate sentencepiece

This model is verified on transformers==4.40.2

💻 Usage Examples

Basic Usage

from transformers import pipeline
translator = pipeline('translation', model='Mitsua/elan-mt-bt-ja-en')
translator('こんにちは。私はAIです。')

Advanced Usage

For longer texts, use pySBD:

import pysbd
seg = pysbd.Segmenter(language="ja", clean=False)
txt = 'こんにちは。私はAIです。お元気ですか？'
print(translator(seg.segment(txt)))

📚 Documentation

Model Details

This is a translation model based on Marian MT 6-layer encoder-decoder transformer architecture with sentencepiece tokenizer.

Property	Details
Developed by	ELAN MITSUA Project / Abstract Engine
Model Type	Translation
Source Language	Japanese
Target Language	English
License	CC BY-SA 4.0

Training Data

The dataset collection heavily referred to FuguMT author's blog post.

Mitsua/wikidata-parallel-descriptions-en-ja (CC0 1.0): A newly built 1.5M lines wikidata parallel corpus.
The Kyoto Free Translation Task (KFTT) (CC BY-SA 3.0)
Tatoeba (CC BY 2.0 FR / CC0 1.0)
wikipedia-interlanguage-titles (The MIT License / CC BY-SA 4.0)
WikiMatrix (CC BY-SA 4.0)
MDN Web Docs (The MIT / CC0 1.0 / CC BY-SA 2.5)
Wikimedia contenttranslation dump (CC BY-SA 4.0)

Training Procedure

The training process and hyperparameter tuning referred to "Beating Edinburgh's WMT2017 system for en-de with Marian's Transformer model".

Train a sentencepiece tokenizer with 32k vocab on 4M lines openly licensed corpus.
Train an en-ja back-translation model on 4M lines openly licensed corpus for 6 epochs (ElanMT-base-en-ja).
Train a ja-en base translation model on 4M lines openly licensed corpus for 6 epochs (ElanMT-base-ja-en).
Translate 20M lines en Wikipedia to ja using the back-translation model.
Train 4 ja-en models, finetuned from ElanMT-base-ja-en checkpoint, on 24M lines training data augmented with back-translated data for 6 epochs.
Merge 4 trained models that produce the best validation score on FLORES+ dev split.
Finetune the merged model on 1M lines high quality corpus subset for 5 epochs.

Evaluation

Dataset

FLORES+ (CC BY-SA 4.0) devtest split.
NTREX (CC BY-SA 4.0)

Result

Model	Params	FLORES+ BLEU	FLORES+ chrf	NTREX BLEU	NTREX chrf
ElanMT-BT	61M	24.87	55.02	22.57	52.48
ElanMT-base	61M	21.61	52.53	18.43	49.09
ElanMT-tiny	15M	20.40	51.81	18.43	49.39
staka/fugumt-ja-en	61M	24.10	54.97	22.33	51.84
facebook/mbart-large-50-many-to-many-mmt	610M	23.88	53.98	22.59	51.57
facebook/nllb-200-distilled-600M	615M	22.92	52.13	22.59	51.36
facebook/nllb-200-3.3B	3B	28.13	56.86	27.65	55.60
google/madlad400-3b-mt	3B	26.95	56.62	26.11	54.61
google/madlad400-7b-mt	7B	28.84	57.46	28.19	55.85

*1 Tested on transformers==4.29.2 and num_beams=4
*2 BLEU score is calculated by sacreBLEU

📄 License

This model is licensed under CC BY-SA 4.0.

🔧 Technical Details

This is a translation model based on Marian MT 6-layer encoder-decoder transformer architecture with sentencepiece tokenizer.

📄 License

This model is released under the CC BY-SA 4.0 license.

⚠️ Important Note

The translated result may be very incorrect, harmful or biased. The model was developed to investigate achievable performance with only a relatively small, licensed corpus, and is not suitable for use cases requiring high translation accuracy. Under Section 5 of the CC BY-SA 4.0 License, ELAN MITSUA Project / Abstract Engine is not responsible for any direct or indirect loss caused by the use of the model.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご