MBart-large-51-myv-mul-v1 Open-source Translation Model - Supports Translating 11 Languages into Erzya

Mbart Large 51 Myv Mul V1

Developed by slone

This is a neural machine translation model that translates 11 languages into Erzya, improved from the mbart-large-50 architecture.

Machine Translation

Transformers

Supports Multiple Languages#Multilingual to Erzya translation #Low-resource language support #Fine-tuned based on MBART

Downloads 19

Release Time : 9/15/2022

Model Overview

This model is specifically designed to translate Russian, Finnish, German, Spanish, English, Hindi, Chinese, Turkish, Ukrainian, French, and Arabic into Erzya (Cyrillic script). It is the first neural machine translation system for the Erzya language.

Model Features

Multilingual support

Supports translation from 11 languages to Erzya

Specialized optimization

Added additional language tags and 19K BPE tokens specifically for Erzya

Two-stage training

First fine-tuned for Russian to Erzya translation, then extended to other languages

Model Capabilities

Text translation

Multilingual mutual translation

Use Cases

Language services

Erzya content creation

Assists non-Erzya speakers in creating Erzya content

Achieves accurate translation from 11 languages to Erzya

Cultural preservation

Promotes digital preservation and usage of the Erzya language

Provides modern machine translation support for minority languages

🚀 Erzya Language Translation Model

This is a model designed to translate texts into the Erzya language (myv, cyrillic script) from 11 other languages: ru, fi, de, es, en, hi, zh, tr, uk, fr, ar. Check out its demo!

It is detailed in the paper The first neural machine translation system for the Erzya language.

✨ Features

Multilingual Translation: Capable of translating from 11 different languages to Erzya.
Based on mbart-large-50: Built upon facebook/mbart-large-50, with an updated vocabulary and checkpoint.
- An extra language token myv_XX and 19K new BPE tokens are added for the Erzya language.
- Fine - tuned for translation from Erzya: first to Russian, then to all 11 languages.

📦 Installation

There is no specific installation steps provided in the original document.

💻 Usage Examples

Basic Usage

from transformers import MBartForConditionalGeneration, MBart50Tokenizer


def fix_tokenizer(tokenizer):
    """ Add a new language token to the tokenizer vocabulary (this should be done each time after its initialization) """
    old_len = len(tokenizer) - int('myv_XX' in tokenizer.added_tokens_encoder)
    tokenizer.lang_code_to_id['myv_XX'] = old_len-1
    tokenizer.id_to_lang_code[old_len-1] = 'myv_XX'
    tokenizer.fairseq_tokens_to_ids["<mask>"] = len(tokenizer.sp_model) + len(tokenizer.lang_code_to_id) + tokenizer.fairseq_offset

    tokenizer.fairseq_tokens_to_ids.update(tokenizer.lang_code_to_id)
    tokenizer.fairseq_ids_to_tokens = {v: k for k, v in tokenizer.fairseq_tokens_to_ids.items()}
    if 'myv_XX' not in tokenizer._additional_special_tokens:
        tokenizer._additional_special_tokens.append('myv_XX')
    tokenizer.added_tokens_encoder = {}


def translate(text, model, tokenizer, src='ru_RU', trg='myv_XX', max_length='auto', num_beams=3, repetition_penalty=5.0, train_mode=False, n_out=None, **kwargs):
    tokenizer.src_lang = src
    encoded = tokenizer(text, return_tensors="pt", truncation=True, max_length=1024)
    if max_length == 'auto':
        max_length = int(32 + 1.5 * encoded.input_ids.shape[1])
    if train_mode:
        model.train()
    else:
        model.eval()
    generated_tokens = model.generate(
        **encoded.to(model.device),
        forced_bos_token_id=tokenizer.lang_code_to_id[trg], 
        max_length=max_length, 
        num_beams=num_beams,
        repetition_penalty=repetition_penalty,
        num_return_sequences=n_out or 1,
        **kwargs
    )
    out = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
    if isinstance(text, str) and n_out is None:
        return out[0]
    return out
    

mname = 'slone/mbart-large-51-myv-mul-v1'
model = MBartForConditionalGeneration.from_pretrained(mname)
tokenizer = MBart50Tokenizer.from_pretrained(mname)
fix_tokenizer(tokenizer)


print(translate('Шумбрат, киска!', model, tokenizer, src='myv_XX', trg='ru_RU'))
# Привет, собака!   # действительно, "киска" с эрзянского переводится именно так
print(translate('Шумбрат, киска!', model, tokenizer, src='myv_XX', trg='en_XX'))
# Hi, dog!

📚 Documentation

Model Information

Property	Details
Model Type	Model for translating texts to Erzya from 11 other languages
Training Data	`slone/myv_ru_2022`, `yhavinga/ccmatrix`

License

This model is licensed under cc-by-sa-4.0.

Supported Languages

The model supports translation from the following languages: myv, ru, fi, de, es, en, hi, zh, tr, uk, fr, ar.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご