đ Wav2Vec2-XLS-R-2B-22-16 (XLS-R-Any-to-Any)
Facebook's Wav2Vec2 XLS - R fine - tuned for Speech Translation, enabling translation across multiple languages.

This is a SpeechEncoderDecoderModel model. The encoder was warm - started from the facebook/wav2vec2-xls-r-2b
checkpoint and the decoder from the facebook/mbart-large-50
checkpoint. Consequently, the encoder - decoder model was fine - tuned on {input_lang}
-> {output_lang}
translation pairs of the Covost2 dataset.
The model can translate from the following spoken languages {input_lang}
to the following written languages {output_lang}
:
{input_lang}
-> {output_lang}
with {input_lang}
one of:
{en
, fr
, de
, es
, ca
, it
, ru
, zh-CN
, pt
, fa
, et
, mn
, nl
, tr
, ar
, sv-SE
, lv
, sl
, ta
, ja
, id
, cy
}
and {output_lang}
:
{en
, de
, tr
, fa
, sv-SE
, mn
, zh-CN
, cy
, ca
, sl
, et
, id
, ar
, ta
, lv
, ja
}
đ Quick Start
⨠Features
- Supports speech translation across multiple languages.
- Based on well - known checkpoints for encoder and decoder.
- Can be easily tested on a dedicated space.
đĻ Installation
No specific installation steps are provided in the original document, so this section is skipped.
đģ Usage Examples
Basic Usage
The model can be tested on this space. You can select the target language, record some audio in any of the above - mentioned input languages, and then sit back and see how well the checkpoint can translate the input.
Advanced Usage
As this a standard sequence to sequence transformer model, you can use the generate
method to generate the transcripts by passing the speech features to the model.
You can use the model directly via the ASR pipeline. By default, the checkpoint will translate spoken English to written German. To change the written target language, you need to pass the correct forced_bos_token_id
to generate(...)
to condition the decoder on the correct target language.
To select the correct forced_bos_token_id
given your choosen language id, please make use of the following mapping:
MAPPING = {
"en": 250004,
"de": 250003,
"tr": 250023,
"fa": 250029,
"sv": 250042,
"mn": 250037,
"zh": 250025,
"cy": 250007,
"ca": 250005,
"sl": 250052,
"et": 250006,
"id": 250032,
"ar": 250001,
"ta": 250044,
"lv": 250017,
"ja": 250012,
}
As an example, if you would like to translate to Swedish, you can do the following:
from datasets import load_dataset
from transformers import pipeline
forced_bos_token_id = MAPPING["sv"]
librispeech_en = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")
audio_file = librispeech_en[0]["file"]
asr = pipeline("automatic-speech-recognition", model="facebook/wav2vec2-xls-r-2b-22-to-16", feature_extractor="facebook/wav2vec2-xls-r-2b-22-to-16")
translation = asr(audio_file, forced_bos_token_id=forced_bos_token_id)
or step - by - step as follows:
import torch
from transformers import Speech2Text2Processor, SpeechEncoderDecoderModel
from datasets import load_dataset
model = SpeechEncoderDecoderModel.from_pretrained("facebook/wav2vec2-xls-r-2b-22-to-16")
processor = Speech2Text2Processor.from_pretrained("facebook/wav2vec2-xls-r-2b-22-to-16")
ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")
forced_bos_token_id = MAPPING["sv"]
inputs = processor(ds[0]["audio"]["array"], sampling_rate=ds[0]["audio"]["array"]["sampling_rate"], return_tensors="pt")
generated_ids = model.generate(input_ids=inputs["input_features"], attention_mask=inputs["attention_mask"], forced_bos_token_id=forced_bos_token)
transcription = processor.batch_decode(generated_ids)
đ Documentation
The model can translate between multiple spoken and written languages as described above. It is based on well - known checkpoints and fine - tuned on the Covost2 dataset. You can test it on a dedicated space and use it via the ASR pipeline with appropriate token ID settings.
đ§ Technical Details
The encoder of the model was warm - started from the facebook/wav2vec2-xls-r-2b
checkpoint and the decoder from the facebook/mbart-large-50
checkpoint. The model is then fine - tuned on translation pairs from the Covost2 dataset.
đ License
This model is licensed under the apache - 2.0
license.
Additional Information
Supported Languages and Datasets
Property |
Details |
Supported Languages |
Input: {en , fr , de , es , ca , it , ru , zh-CN , pt , fa , et , mn , nl , tr , ar , sv-SE , lv , sl , ta , ja , id , cy }; Output: {en , de , tr , fa , sv-SE , mn , zh-CN , cy , ca , sl , et , id , ar , ta , lv , ja } |
Datasets |
common_voice , multilingual_librispeech , covost2 |
More XLS - R models for {lang}
-> en
Speech Translation
Widget Examples