Open-source Wav2vec2-xls-r-2b-22-to-16 Model - A Powerful Voice Translation Tool Supporting Translations among 22 to 16 Languages

Wav2vec2 Xls R 2b 22 To 16

Developed by facebook

Facebook's Wav2Vec2 XLS-R model fine-tuned for multilingual speech translation tasks, supporting mutual translation between 22 input languages and 16 output languages.

Speech Recognition

Transformers

Supports Multiple LanguagesOpen Source License:Apache-2.0 #Multilingual Speech Translation #Large Model Speech Processing #Real-time Speech Transcription

Downloads 38

Release Time : 3/2/2022

Model Overview

This is a speech translation model based on the SpeechEncoderDecoder architecture, capable of translating multiple spoken languages into written languages. The encoder is based on wav2vec2-xls-r-2b, and the decoder is based on mbart-large-50, fine-tuned on the Covost2 dataset.

Model Features

Multilingual Support

Supports mutual translation between 22 input languages and 16 output languages, covering a wide range of language needs.

Large-scale Pretraining

Based on the 2-billion-parameter Wav2Vec2-XLS-R model, with powerful speech feature extraction capabilities.

End-to-end Translation

Direct translation from speech to target language text, without intermediate transcription steps.

Model Capabilities

Speech Recognition

Multilingual Translation

Speech-to-Text Conversion

Use Cases

International Communication

Real-time Speech Translation

Translates speech in meetings or conversations into other languages in real-time.

Supports accurate translation for multiple language combinations.

Media Processing

Video Subtitle Generation

Automatically generates translated subtitles for foreign-language videos.

Supports subtitle generation for multiple language pairs.

🚀 Wav2Vec2-XLS-R-2B-22-16 (XLS-R-Any-to-Any)

Facebook's Wav2Vec2 XLS - R fine - tuned for Speech Translation, enabling translation across multiple languages.

model image

This is a SpeechEncoderDecoderModel model. The encoder was warm - started from the facebook/wav2vec2-xls-r-2b checkpoint and the decoder from the facebook/mbart-large-50 checkpoint. Consequently, the encoder - decoder model was fine - tuned on {input_lang} -> {output_lang} translation pairs of the Covost2 dataset.

The model can translate from the following spoken languages {input_lang} to the following written languages {output_lang}:

{input_lang} -> {output_lang}

with {input_lang} one of:

{en, fr, de, es, ca, it, ru, zh-CN, pt, fa, et, mn, nl, tr, ar, sv-SE, lv, sl, ta, ja, id, cy}

and {output_lang}:

{en, de, tr, fa, sv-SE, mn, zh-CN, cy, ca, sl, et, id, ar, ta, lv, ja}

🚀 Quick Start

✨ Features

Supports speech translation across multiple languages.
Based on well - known checkpoints for encoder and decoder.
Can be easily tested on a dedicated space.

📦 Installation

No specific installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

Basic Usage

The model can be tested on this space. You can select the target language, record some audio in any of the above - mentioned input languages, and then sit back and see how well the checkpoint can translate the input.

Advanced Usage

As this a standard sequence to sequence transformer model, you can use the generate method to generate the transcripts by passing the speech features to the model.

You can use the model directly via the ASR pipeline. By default, the checkpoint will translate spoken English to written German. To change the written target language, you need to pass the correct forced_bos_token_id to generate(...) to condition the decoder on the correct target language.

To select the correct forced_bos_token_id given your choosen language id, please make use of the following mapping:

MAPPING = {
    "en": 250004,
    "de": 250003,
    "tr": 250023,
    "fa": 250029,
    "sv": 250042,
    "mn": 250037,
    "zh": 250025,
    "cy": 250007,
    "ca": 250005,
    "sl": 250052,
    "et": 250006,
    "id": 250032,
    "ar": 250001,
    "ta": 250044,
    "lv": 250017,
    "ja": 250012,
}

As an example, if you would like to translate to Swedish, you can do the following:

from datasets import load_dataset
from transformers import pipeline

# select correct `forced_bos_token_id`
forced_bos_token_id = MAPPING["sv"]

# replace following lines to load an audio file of your choice
librispeech_en = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")
audio_file = librispeech_en[0]["file"]

asr = pipeline("automatic-speech-recognition", model="facebook/wav2vec2-xls-r-2b-22-to-16", feature_extractor="facebook/wav2vec2-xls-r-2b-22-to-16")

translation = asr(audio_file, forced_bos_token_id=forced_bos_token_id)

or step - by - step as follows:

import torch
from transformers import Speech2Text2Processor, SpeechEncoderDecoderModel
from datasets import load_dataset

model = SpeechEncoderDecoderModel.from_pretrained("facebook/wav2vec2-xls-r-2b-22-to-16")
processor = Speech2Text2Processor.from_pretrained("facebook/wav2vec2-xls-r-2b-22-to-16")

ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")

# select correct `forced_bos_token_id`
forced_bos_token_id = MAPPING["sv"]

inputs = processor(ds[0]["audio"]["array"], sampling_rate=ds[0]["audio"]["array"]["sampling_rate"], return_tensors="pt")
generated_ids = model.generate(input_ids=inputs["input_features"], attention_mask=inputs["attention_mask"], forced_bos_token_id=forced_bos_token)
transcription = processor.batch_decode(generated_ids)

📚 Documentation

The model can translate between multiple spoken and written languages as described above. It is based on well - known checkpoints and fine - tuned on the Covost2 dataset. You can test it on a dedicated space and use it via the ASR pipeline with appropriate token ID settings.

🔧 Technical Details

The encoder of the model was warm - started from the facebook/wav2vec2-xls-r-2b checkpoint and the decoder from the facebook/mbart-large-50 checkpoint. The model is then fine - tuned on translation pairs from the Covost2 dataset.

📄 License

This model is licensed under the apache - 2.0 license.

Additional Information

Supported Languages and Datasets

Property	Details
Supported Languages	Input: `{en`, `fr`, `de`, `es`, `ca`, `it`, `ru`, `zh-CN`, `pt`, `fa`, `et`, `mn`, `nl`, `tr`, `ar`, `sv-SE`, `lv`, `sl`, `ta`, `ja`, `id`, `cy`}; Output: `{en`, `de`, `tr`, `fa`, `sv-SE`, `mn`, `zh-CN`, `cy`, `ca`, `sl`, `et`, `id`, `ar`, `ta`, `lv`, `ja`}
Datasets	`common_voice`, `multilingual_librispeech`, `covost2`

More XLS - R models for `{lang}` -> `en` Speech Translation

Widget Examples

Swedish: Audio Sample
Arabic: Audio Sample
Russian: Audio Sample
German: Audio Sample
French: Audio Sample
Indonesian: Audio Sample
Italian: Audio Sample
Japanese: Audio Sample
Mongolian: Audio Sample
Dutch: Audio Sample
Russian: Audio Sample
Turkish: Audio Sample
Catalan: Audio Sample
English: Audio Sample
Dutch: Audio Sample

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご