Wav2vec2 Xls R 1b 21 To En

Developed by facebook

Facebook's Wav2Vec2 XLS-R model for multilingual speech-to-English translation tasks

Supports Multiple LanguagesOpen Source License:Apache-2.0 #Multilingual speech translation #21 languages to English #Large-scale pretraining

Downloads 511

Release Time : 3/2/2022

Model Overview

This is a model based on the SpeechEncoderDecoder architecture, capable of translating speech from 21 languages into English. The encoder is based on facebook/wav2vec2-xls-r-1b, and the decoder is based on facebook/mbart-large-50, fine-tuned on the Covost2 dataset.

Model Features

Multilingual support

Supports speech translation from 21 languages to English

Large-scale pretraining

Based on the 2-billion-parameter XLS-R model with powerful speech feature extraction capabilities

End-to-end translation

Direct end-to-end translation from speech to target language text

Model Capabilities

Speech recognition

Multilingual translation

Speech-to-text conversion

Use Cases

Speech translation

Real-time speech translation

Translates real-time speech in meetings, lectures, etc., into English

Performs excellently on the Covost2 dataset

Multilingual voice assistant

Provides multilingual input support for voice assistants

language:

multilingual
fr
de
es
ca
it
ru
zh
pt
fa
et
mn
nl
tr
ar
sv
lv
sl
ta
ja
id
cy
en datasets:
common_voice
multilingual_librispeech
covost2 tags:
speech
xls_r
automatic-speech-recognition
xls_r_translation pipeline_tag: automatic-speech-recognition license: apache-2.0 widget:
example_title: Swedish src: https://cdn-media.huggingface.co/speech_samples/cv_swedish_1.mp3
example_title: Arabic src: https://cdn-media.huggingface.co/speech_samples/common_voice_ar_19058308.mp3
example_title: Russian src: https://cdn-media.huggingface.co/speech_samples/common_voice_ru_18849022.mp3
example_title: German src: https://cdn-media.huggingface.co/speech_samples/common_voice_de_17284683.mp3
example_title: French src: https://cdn-media.huggingface.co/speech_samples/common_voice_fr_17299386.mp3
example_title: Indonesian src: https://cdn-media.huggingface.co/speech_samples/common_voice_id_19051309.mp3
example_title: Italian src: https://cdn-media.huggingface.co/speech_samples/common_voice_it_17415776.mp3
example_title: Japanese src: https://cdn-media.huggingface.co/speech_samples/common_voice_ja_19482488.mp3
example_title: Mongolian src: https://cdn-media.huggingface.co/speech_samples/common_voice_mn_18565396.mp3
example_title: Dutch src: https://cdn-media.huggingface.co/speech_samples/common_voice_nl_17691471.mp3
example_title: Russian src: https://cdn-media.huggingface.co/speech_samples/common_voice_ru_18849022.mp3
example_title: Turkish src: https://cdn-media.huggingface.co/speech_samples/common_voice_tr_17341280.mp3
example_title: Catalan src: https://cdn-media.huggingface.co/speech_samples/common_voice_ca_17367522.mp3
example_title: English src: https://cdn-media.huggingface.co/speech_samples/common_voice_en_18301577.mp3
example_title: Dutch src: https://cdn-media.huggingface.co/speech_samples/common_voice_nl_17691471.mp3

Wav2Vec2-XLS-R-2b-21-EN

Facebook's Wav2Vec2 XLS-R fine-tuned for Speech Translation.

model image

This is a SpeechEncoderDecoderModel model. The encoder was warm-started from the facebook/wav2vec2-xls-r-1b checkpoint and the decoder from the facebook/mbart-large-50 checkpoint. Consequently, the encoder-decoder model was fine-tuned on 21 {lang} -> en translation pairs of the Covost2 dataset.

The model can translate from the following spoken languages {lang} -> en (English):

{fr, de, es, ca, it, ru, zh-CN, pt, fa, et, mn, nl, tr, ar, sv-SE, lv, sl, ta, ja, id, cy} -> en

For more information, please refer to Section 5.1.2 of the official XLS-R paper.

Usage

Demo

The model can be tested directly on the speech recognition widget on this model card! Simple record some audio in one of the possible spoken languages or pick an example audio file to see how well the checkpoint can translate the input.

Example

As this a standard sequence to sequence transformer model, you can use the generate method to generate the transcripts by passing the speech features to the model.

You can use the model directly via the ASR pipeline

from datasets import load_dataset
from transformers import pipeline

# replace following lines to load an audio file of your choice
librispeech_en = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")
audio_file = librispeech_en[0]["file"]

asr = pipeline("automatic-speech-recognition", model="facebook/wav2vec2-xls-r-1b-21-to-en", feature_extractor="facebook/wav2vec2-xls-r-1b-21-to-en")

translation = asr(audio_file)

or step-by-step as follows:

import torch
from transformers import Speech2Text2Processor, SpeechEncoderDecoderModel
from datasets import load_dataset

model = SpeechEncoderDecoderModel.from_pretrained("facebook/wav2vec2-xls-r-1b-21-to-en")
processor = Speech2Text2Processor.from_pretrained("facebook/wav2vec2-xls-r-1b-21-to-en")

ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")

inputs = processor(ds[0]["audio"]["array"], sampling_rate=ds[0]["audio"]["array"]["sampling_rate"], return_tensors="pt")
generated_ids = model.generate(input_ids=inputs["input_features"], attention_mask=inputs["attention_mask"])
transcription = processor.batch_decode(generated_ids)

Results `{lang}` -> `en`

See the row of XLS-R (1B) for the performance on Covost2 for this model.

results image

More XLS-R models for `{lang}` -> `en` Speech Translation

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご