đ Wav2Vec2-XLS-R-2b-21-EN
This is a fine - tuned model for speech translation, capable of translating from multiple spoken languages to English.
đ Quick Start
This model is a SpeechEncoderDecoderModel. The encoder starts from the facebook/wav2vec2-xls-r-2b
checkpoint, and the decoder from the facebook/mbart-large-50
checkpoint. It has been fine - tuned on 21 {lang}
-> en
translation pairs of the Covost2 dataset.
Supported Languages
The model can translate from the following spoken languages {lang}
-> en
(English):
{fr
, de
, es
, ca
, it
, ru
, zh - CN
, pt
, fa
, et
, mn
, nl
, tr
, ar
, sv - SE
, lv
, sl
, ta
, ja
, id
, cy
} -> en
For more information, please refer to Section 5.1.2 of the official XLS - R paper.
⨠Features
- Multilingual Support: Capable of translating from multiple spoken languages to English.
- Fine - Tuned Model: Based on pre - trained checkpoints and fine - tuned on specific datasets for better performance.
đĻ Installation
No specific installation steps are provided in the original document, so this section is skipped.
đģ Usage Examples
Demo
The model can be tested directly on the speech recognition widget on this model card! Simply record some audio in one of the possible spoken languages or pick an example audio file to see how well the checkpoint can translate the input.
Basic Usage
As this is a standard sequence - to - sequence transformer model, you can use the generate
method to generate the transcripts by passing the speech features to the model.
You can use the model directly via the ASR pipeline:
from datasets import load_dataset
from transformers import pipeline
librispeech_en = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")
audio_file = librispeech_en[0]["file"]
asr = pipeline("automatic - speech - recognition", model="facebook/wav2vec2 - xls - r - 2b - 21 - to - en", feature_extractor="facebook/wav2vec2 - xls - r - 2b - 21 - to - en")
translation = asr(audio_file)
Advanced Usage
Or step - by - step as follows:
import torch
from transformers import Speech2Text2Processor, SpeechEncoderDecoderModel
from datasets import load_dataset
model = SpeechEncoderDecoderModel.from_pretrained("facebook/wav2vec2 - xls - r - 2b - 21 - to - en")
processor = Speech2Text2Processor.from_pretrained("facebook/wav2vec2 - xls - r - 2b - 21 - to - en")
ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")
inputs = processor(ds[0]["audio"]["array"], sampling_rate=ds[0]["audio"]["array"]["sampling_rate"], return_tensors="pt")
generated_ids = model.generate(input_ids=inputs["input_features"], attention_mask=inputs["attention_mask"])
transcription = processor.batch_decode(generated_ids)
đ Documentation
Model Information
Property |
Details |
Model Type |
SpeechEncoderDecoderModel |
Training Data |
common_voice, multilingual_librispeech, covost2 |
Tags |
speech, xls_r, automatic - speech - recognition, xls_r_translation |
Pipeline Tag |
automatic - speech - recognition |
Widget Examples
Example Title |
Audio Source |
Swedish |
[https://cdn - media.huggingface.co/speech_samples/cv_swedish_1.mp3](https://cdn - media.huggingface.co/speech_samples/cv_swedish_1.mp3) |
Arabic |
[https://cdn - media.huggingface.co/speech_samples/common_voice_ar_19058308.mp3](https://cdn - media.huggingface.co/speech_samples/common_voice_ar_19058308.mp3) |
Russian |
[https://cdn - media.huggingface.co/speech_samples/common_voice_ru_18849022.mp3](https://cdn - media.huggingface.co/speech_samples/common_voice_ru_18849022.mp3) |
German |
[https://cdn - media.huggingface.co/speech_samples/common_voice_de_17284683.mp3](https://cdn - media.huggingface.co/speech_samples/common_voice_de_17284683.mp3) |
French |
[https://cdn - media.huggingface.co/speech_samples/common_voice_fr_17299386.mp3](https://cdn - media.huggingface.co/speech_samples/common_voice_fr_17299386.mp3) |
Indonesian |
[https://cdn - media.huggingface.co/speech_samples/common_voice_id_19051309.mp3](https://cdn - media.huggingface.co/speech_samples/common_voice_id_19051309.mp3) |
Italian |
[https://cdn - media.huggingface.co/speech_samples/common_voice_it_17415776.mp3](https://cdn - media.huggingface.co/speech_samples/common_voice_it_17415776.mp3) |
Japanese |
[https://cdn - media.huggingface.co/speech_samples/common_voice_ja_19482488.mp3](https://cdn - media.huggingface.co/speech_samples/common_voice_ja_19482488.mp3) |
Mongolian |
[https://cdn - media.huggingface.co/speech_samples/common_voice_mn_18565396.mp3](https://cdn - media.huggingface.co/speech_samples/common_voice_mn_18565396.mp3) |
Dutch |
[https://cdn - media.huggingface.co/speech_samples/common_voice_nl_17691471.mp3](https://cdn - media.huggingface.co/speech_samples/common_voice_nl_17691471.mp3) |
Russian |
[https://cdn - media.huggingface.co/speech_samples/common_voice_ru_18849022.mp3](https://cdn - media.huggingface.co/speech_samples/common_voice_ru_18849022.mp3) |
Turkish |
[https://cdn - media.huggingface.co/speech_samples/common_voice_tr_17341280.mp3](https://cdn - media.huggingface.co/speech_samples/common_voice_tr_17341280.mp3) |
Catalan |
[https://cdn - media.huggingface.co/speech_samples/common_voice_ca_17367522.mp3](https://cdn - media.huggingface.co/speech_samples/common_voice_ca_17367522.mp3) |
English |
[https://cdn - media.huggingface.co/speech_samples/common_voice_en_18301577.mp3](https://cdn - media.huggingface.co/speech_samples/common_voice_en_18301577.mp3) |
Dutch |
[https://cdn - media.huggingface.co/speech_samples/common_voice_nl_17691471.mp3](https://cdn - media.huggingface.co/speech_samples/common_voice_nl_17691471.mp3) |
đ§ Technical Details
No specific technical details are provided in the original document, so this section is skipped.
đ License
This model is licensed under the apache - 2.0
license.