đ MITRE 913M
MITRE (Multilingual Translation with Registers) is a multilingual, decoder-only model crafted for many-to-many translation tasks. It supports direct translation across 552 directions for 24 languages from 5 language families. This repo enables you to use our pre-trained model for inference.
đ Quick Start
Before getting the tokenizer, you need to run pip install sentencepiece
first. Then you can easily call the tokenizer and the model.
đť Usage Examples
Basic Usage
from transformers import AutoModel, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("naist-nlp/mitre_913m", trust_remote_code=True, use_fast=False)
model = AutoModel.from_pretrained("naist-nlp/mitre_913m", trust_remote_code=True)
Advanced Usage
To use this model locally and check the codes, you can clone this hub.
from mitre_913m.tokenization_mitre import MitreTokenizer
from mitre_913m.modeling_mitre import MitreForConditionalGeneration
tokenizer = MitreTokenizer.from_pretrained("mitre_913m")
model = MitreForConditionalGeneration.from_pretrained("mitre_913m")
After getting the model and tokenizer objects, you can perform translation.
english_text = "I have a red apple."
chinese_text = "ććä¸ä¸Şçş˘čšćă"
model.half()
model.eval()
src_tokens = tokenizer.encode_source_tokens_to_input_ids([english_text, ], target_language="zh")
generated_tokens = model.generate(src_tokens.cuda())
results = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
print(results)
đ Documentation
The technology of registering is introduced in our paper. If you want to reproduce the data mining and training, please refer to this repository. An alternative version of MITRE with 466M parameters is also available in this repository.
đ§ Technical Details
We generally follow the style of M2M, but make some necessary improvements to reduce generation cost. You can refer to the 'generate()' codes in modeling_mitre.py for more details. Additionally, we plan to implement FlashAttention V2 to further enhance our models, and will update as soon as possible.
đ License
This project is licensed under the MIT license.
Languages covered
- Germanic: English (en), German (de), Dutch; Flemish (nl), Swedish (sv), Danish (da), Afrikaans (af)
- Romance: French (fr), Spanish (es), Italian (it), Portuguese (pt), Romanian; Moldavian; Moldovan (ro)
- Slavic: Russian (ru), Czech (cs), Polish (pl), Bulgarian (bg), Ukrainian (uk)
- Malayo - Polynesian: Indonesian (id), Malay (ms), Javanese (jv), Tagalog; Filipino (tl)
- Asian*: Chinese (zh), Japanese (ja), Korean (ko), Vietnamese (vi)
BibTeX entry and citation info
@misc{qu2025registeringsourcetokenstarget,
title={Registering Source Tokens to Target Language Spaces in Multilingual Neural Machine Translation},
author={Zhi Qu and Yiran Wang and Jiannan Mao and Chenchen Ding and Hideki Tanaka and Masao Utiyama and Taro Watanabe},
year={2025},
eprint={2501.02979},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2501.02979},
}