đ S2T2-Wav2Vec2-CoVoST2-EN-AR-ST
s2t-wav2vec2-large-en-ar
is a Speech to Text Transformer model trained for end-to-end Speech Translation (ST). It offers an efficient solution for translating English speech directly into Arabic text.
đ Quick Start
This model can be used for end-to-end English speech to Arabic text translation. You can use the model directly via the ASR pipeline or step - by - step as shown in the usage examples below.
⨠Features
- End - to - End Translation: Capable of performing direct English speech to Arabic text translation.
- Transformer - Based: Utilizes a transformer - based seq2seq (speech encoder - decoder) architecture.
- Pretrained Encoder: Employs a pretrained Wav2Vec2 as the encoder.
đĻ Installation
No specific installation steps are provided in the original document.
đģ Usage Examples
Basic Usage
from datasets import load_dataset
from transformers import pipeline
librispeech_en = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")
asr = pipeline("automatic-speech-recognition", model="facebook/s2t-wav2vec2-large-en-ar", feature_extractor="facebook/s2t-wav2vec2-large-en-ar")
translation = asr(librispeech_en[0]["file"])
Advanced Usage
import torch
from transformers import Speech2Text2Processor, SpeechEncoderDecoder
from datasets import load_dataset
import soundfile as sf
model = SpeechEncoderDecoder.from_pretrained("facebook/s2t-wav2vec2-large-en-ar")
processor = Speech2Text2Processor.from_pretrained("facebook/s2t-wav2vec2-large-en-ar")
def map_to_array(batch):
speech, _ = sf.read(batch["file"])
batch["speech"] = speech
return batch
ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")
ds = ds.map(map_to_array)
inputs = processor(ds["speech"][0], sampling_rate=16_000, return_tensors="pt")
generated_ids = model.generate(input_ids=inputs["input_features"], attention_mask=inputs["attention_mask"])
transcription = processor.batch_decode(generated_ids)
đ Documentation
Model description
S2T2 is a transformer - based seq2seq (speech encoder - decoder) model designed for end - to - end Automatic Speech Recognition (ASR) and Speech Translation (ST). It uses a pretrained Wav2Vec2 as the encoder and a transformer - based decoder. The model is trained with standard autoregressive cross - entropy loss and generates the translations autoregressively.
Intended uses & limitations
This model can be used for end - to - end English speech to Arabic text translation. See the model hub to look for other S2T2 checkpoints.
How to use
As this is a standard sequence to sequence transformer model, you can use the generate
method to generate the transcripts by passing the speech features to the model.
đ§ Technical Details
The S2T2 model was proposed in Large - Scale Self - and Semi - Supervised Learning for Speech Translation and officially released in Fairseq.
đ License
This model is licensed under the MIT license.
Additional Information
Evaluation results
CoVoST - V2 test results for en - ar (BLEU score): 20.2
For more information, please have a look at the official paper - especially row 10 of Table 2.
BibTeX entry and citation info
@article{DBLP:journals/corr/abs-2104-06678,
author = {Changhan Wang and
Anne Wu and
Juan Miguel Pino and
Alexei Baevski and
Michael Auli and
Alexis Conneau},
title = {Large-Scale Self- and Semi-Supervised Learning for Speech Translation},
journal = {CoRR},
volume = {abs/2104.06678},
year = {2021},
url = {https://arxiv.org/abs/2104.06678},
archivePrefix = {arXiv},
eprint = {2104.06678},
timestamp = {Thu, 12 Aug 2021 15:37:06 +0200},
biburl = {https://dblp.org/rec/journals/corr/abs-2104-06678.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
Information Table
Property |
Details |
Model Type |
Speech to Text Transformer for end - to - end Speech Translation |
Training Data |
covost2, librispeech_asr |
Tags |
audio, speech - translation, automatic - speech - recognition, speech2text2 |
Pipeline Tag |
automatic - speech - recognition |