Wav2Vec2 XLS - R Open-Source Speech Translation Model - Freely Translate Multilingual Speech into English

Wav2vec2 Xls R 2b 21 To En

Developed by facebook

Facebook's Wav2Vec2 XLS-R model for multilingual speech-to-English translation tasks.

Supports Multiple LanguagesOpen Source License:Apache-2.0 #Multilingual speech translation #21 languages to English #Large-scale speech model

Downloads 38

Release Time : 3/2/2022

Model Overview

This model is a speech translation model based on the Wav2Vec2 XLS-R architecture, capable of translating speech inputs from 21 languages into English text.

Model Features

Multilingual support

Supports translation of speech inputs from 21 different languages into English

Large-scale model

Based on the 2-billion-parameter Wav2Vec2 XLS-R architecture, delivering high-quality translation results

End-to-end translation

Direct end-to-end translation from speech input to English text without intermediate transcription steps

Model Capabilities

Speech translation

Multilingual processing

Automatic speech recognition

Use Cases

Speech translation services

Real-time speech translation

Real-time translation of foreign language speech in meetings or conversations into English

Speech content localization

Translation of foreign language podcasts, videos, etc., into English text

Assistive technology

Accessibility applications

Helping hearing-impaired individuals understand foreign language speech content

🚀 Wav2Vec2-XLS-R-2b-21-EN

This is a fine - tuned model for speech translation, capable of translating from multiple spoken languages to English.

🚀 Quick Start

This model is a SpeechEncoderDecoderModel. The encoder starts from the facebook/wav2vec2-xls-r-2b checkpoint, and the decoder from the facebook/mbart-large-50 checkpoint. It has been fine - tuned on 21 {lang} -> en translation pairs of the Covost2 dataset.

Supported Languages

The model can translate from the following spoken languages {lang} -> en (English): {fr, de, es, ca, it, ru, zh - CN, pt, fa, et, mn, nl, tr, ar, sv - SE, lv, sl, ta, ja, id, cy} -> en

For more information, please refer to Section 5.1.2 of the official XLS - R paper.

✨ Features

Multilingual Support: Capable of translating from multiple spoken languages to English.
Fine - Tuned Model: Based on pre - trained checkpoints and fine - tuned on specific datasets for better performance.

📦 Installation

No specific installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

Demo

The model can be tested directly on the speech recognition widget on this model card! Simply record some audio in one of the possible spoken languages or pick an example audio file to see how well the checkpoint can translate the input.

Basic Usage

As this is a standard sequence - to - sequence transformer model, you can use the generate method to generate the transcripts by passing the speech features to the model.

You can use the model directly via the ASR pipeline:

from datasets import load_dataset
from transformers import pipeline

# replace following lines to load an audio file of your choice
librispeech_en = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")
audio_file = librispeech_en[0]["file"]

asr = pipeline("automatic - speech - recognition", model="facebook/wav2vec2 - xls - r - 2b - 21 - to - en", feature_extractor="facebook/wav2vec2 - xls - r - 2b - 21 - to - en")

translation = asr(audio_file)

Advanced Usage

Or step - by - step as follows:

import torch
from transformers import Speech2Text2Processor, SpeechEncoderDecoderModel
from datasets import load_dataset

model = SpeechEncoderDecoderModel.from_pretrained("facebook/wav2vec2 - xls - r - 2b - 21 - to - en")
processor = Speech2Text2Processor.from_pretrained("facebook/wav2vec2 - xls - r - 2b - 21 - to - en")

ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")

inputs = processor(ds[0]["audio"]["array"], sampling_rate=ds[0]["audio"]["array"]["sampling_rate"], return_tensors="pt")
generated_ids = model.generate(input_ids=inputs["input_features"], attention_mask=inputs["attention_mask"])
transcription = processor.batch_decode(generated_ids)

📚 Documentation

Model Information

Property	Details
Model Type	SpeechEncoderDecoderModel
Training Data	common_voice, multilingual_librispeech, covost2
Tags	speech, xls_r, automatic - speech - recognition, xls_r_translation
Pipeline Tag	automatic - speech - recognition

Widget Examples

Example Title	Audio Source
Swedish	[https://cdn - media.huggingface.co/speech_samples/cv_swedish_1.mp3](https://cdn - media.huggingface.co/speech_samples/cv_swedish_1.mp3)
Arabic	[https://cdn - media.huggingface.co/speech_samples/common_voice_ar_19058308.mp3](https://cdn - media.huggingface.co/speech_samples/common_voice_ar_19058308.mp3)
Russian	[https://cdn - media.huggingface.co/speech_samples/common_voice_ru_18849022.mp3](https://cdn - media.huggingface.co/speech_samples/common_voice_ru_18849022.mp3)
German	[https://cdn - media.huggingface.co/speech_samples/common_voice_de_17284683.mp3](https://cdn - media.huggingface.co/speech_samples/common_voice_de_17284683.mp3)
French	[https://cdn - media.huggingface.co/speech_samples/common_voice_fr_17299386.mp3](https://cdn - media.huggingface.co/speech_samples/common_voice_fr_17299386.mp3)
Indonesian	[https://cdn - media.huggingface.co/speech_samples/common_voice_id_19051309.mp3](https://cdn - media.huggingface.co/speech_samples/common_voice_id_19051309.mp3)
Italian	[https://cdn - media.huggingface.co/speech_samples/common_voice_it_17415776.mp3](https://cdn - media.huggingface.co/speech_samples/common_voice_it_17415776.mp3)
Japanese	[https://cdn - media.huggingface.co/speech_samples/common_voice_ja_19482488.mp3](https://cdn - media.huggingface.co/speech_samples/common_voice_ja_19482488.mp3)
Mongolian	[https://cdn - media.huggingface.co/speech_samples/common_voice_mn_18565396.mp3](https://cdn - media.huggingface.co/speech_samples/common_voice_mn_18565396.mp3)
Dutch	[https://cdn - media.huggingface.co/speech_samples/common_voice_nl_17691471.mp3](https://cdn - media.huggingface.co/speech_samples/common_voice_nl_17691471.mp3)
Russian	[https://cdn - media.huggingface.co/speech_samples/common_voice_ru_18849022.mp3](https://cdn - media.huggingface.co/speech_samples/common_voice_ru_18849022.mp3)
Turkish	[https://cdn - media.huggingface.co/speech_samples/common_voice_tr_17341280.mp3](https://cdn - media.huggingface.co/speech_samples/common_voice_tr_17341280.mp3)
Catalan	[https://cdn - media.huggingface.co/speech_samples/common_voice_ca_17367522.mp3](https://cdn - media.huggingface.co/speech_samples/common_voice_ca_17367522.mp3)
English	[https://cdn - media.huggingface.co/speech_samples/common_voice_en_18301577.mp3](https://cdn - media.huggingface.co/speech_samples/common_voice_en_18301577.mp3)
Dutch	[https://cdn - media.huggingface.co/speech_samples/common_voice_nl_17691471.mp3](https://cdn - media.huggingface.co/speech_samples/common_voice_nl_17691471.mp3)

🔧 Technical Details

No specific technical details are provided in the original document, so this section is skipped.

📄 License

This model is licensed under the apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご