๐ mbart-ja-en
This model is based on facebook/mbart-large-cc25 and fine - tuned on the JESC dataset, aiming to provide high - quality Japanese - English translation.
๐ Quick Start
This section shows you how to quickly start using the mbart - ja - en model for Japanese - English translation.
from transformers import (
MBartForConditionalGeneration, MBartTokenizer
)
tokenizer = MBartTokenizer.from_pretrained("ken11/mbart-ja-en")
model = MBartForConditionalGeneration.from_pretrained("ken11/mbart-ja-en")
inputs = tokenizer("ใใใซใกใฏ", return_tensors="pt")
translated_tokens = model.generate(**inputs, decoder_start_token_id=tokenizer.lang_code_to_id["en_XX"], early_stopping=True, max_length=48)
pred = tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]
print(pred)
โจ Features
- Based on a large - scale model: Built on [facebook/mbart - large - cc25](https://huggingface.co/facebook/mbart - large - cc25), leveraging the pre - trained knowledge of the large - scale model.
- Fine - tuned on specific data: Fine - tuned on the JESC dataset, which can better adapt to Japanese - English translation tasks.
๐ฆ Installation
The installation of this model mainly involves installing the transformers
library. You can use the following command to install it:
pip install transformers
๐ป Usage Examples
Basic Usage
from transformers import (
MBartForConditionalGeneration, MBartTokenizer
)
tokenizer = MBartTokenizer.from_pretrained("ken11/mbart-ja-en")
model = MBartForConditionalGeneration.from_pretrained("ken11/mbart-ja-en")
inputs = tokenizer("ใใใซใกใฏ", return_tensors="pt")
translated_tokens = model.generate(**inputs, decoder_start_token_id=tokenizer.lang_code_to_id["en_XX"], early_stopping=True, max_length=48)
pred = tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]
print(pred)
Advanced Usage
from transformers import (
MBartForConditionalGeneration, MBartTokenizer
)
tokenizer = MBartTokenizer.from_pretrained("ken11/mbart-ja-en")
model = MBartForConditionalGeneration.from_pretrained("ken11/mbart-ja-en")
text = "ใใฎๆ็ซ ใฏใใ้ทใๆ็ซ ใงใใ"
inputs = tokenizer(text, return_tensors="pt")
translated_tokens = model.generate(**inputs, decoder_start_token_id=tokenizer.lang_code_to_id["en_XX"], early_stopping=True, max_length=128)
pred = tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]
print(pred)
๐ Documentation
Training Data
I used the JESC dataset for training.
Thank you for publishing such a large dataset.
Tokenizer
The tokenizer uses the sentencepiece trained on the JESC dataset.
Note
The result of evaluating the sacrebleu score for [JEC Basic Sentence Data of Kyoto University](https://nlp.ist.i.kyoto - u.ac.jp/EN/?JEC+Basic+Sentence+Data#i0163896) was 18.18
.
๐ License
This project is licensed under The MIT license.