s2t-medium-mustc-multilingual-st Open-source Model - Achieving English-to-multilingual Speech Translation

S2t Medium Mustc Multilingual St

Developed by facebook

Transformer-based end-to-end multilingual speech translation model supporting English-to-multiple language speech translation

Speech Recognition

Transformers

Supports Multiple LanguagesOpen Source License:MIT #Multilingual speech translation #End-to-end model #TED talk translation

Downloads 7,322

Release Time : 3/2/2022

Model Overview

This model adopts the Transformer architecture, specifically designed for end-to-end automatic speech recognition and speech translation. It processes speech input through convolutional downsampling and generates translation results in an autoregressive manner.

Model Features

Multilingual support

Supports speech translation from English to 8 languages, including French, German, Spanish, etc.

End-to-end architecture

Features an end-to-end design that directly generates target language text from speech features, simplifying traditional pipeline systems.

Efficient speech processing

Reduces speech input length by 3/4 through convolutional downsampling, improving processing efficiency.

Model Capabilities

English speech recognition

Multilingual speech translation

Automatic speech-to-text

Use Cases

Speech translation services

Real-time speech translation

Translates English speeches or conversations into target language text in real-time

Achieves 24.5-34.9 BLEU scores on the MuST-C test set

Multimedia subtitle generation

Generates multilingual subtitles for English video content

Language learning assistance

Language learning tool

Helps language learners understand English speech content

🚀 S2T-MEDIUM-MUSTC-MULTILINGUAL-ST

s2t-medium-mustc-multilingual-st is a Speech to Text Transformer (S2T) model trained for end-to-end Multilingual Speech Translation (ST). It offers a solution for converting English speech into text in multiple languages, leveraging the power of the Transformer architecture.

🚀 Quick Start

This model can be used for end-to-end English speech to French text translation. You can also explore other S2T checkpoints on the model hub.

✨ Features

Multilingual Support: Supports multiple languages including English, German, Dutch, Spanish, French, Italian, Portuguese, Romanian, and Russian.
End-to-End ST: Designed for end-to-end Automatic Speech Recognition (ASR) and Speech Translation (ST).
Transformer-based: Built on the transformer-based seq2seq (encoder-decoder) architecture.

📦 Installation

You could either install those as extra speech dependancies with pip install transformers"[speech, sentencepiece]" or install the packages separately with pip install torchaudio sentencepiece.

⚠️ Important Note

The Speech2TextProcessor object uses torchaudio to extract the filter bank features. Make sure to install the torchaudio package before running the usage example.

💻 Usage Examples

Basic Usage

import torch
from transformers import Speech2TextProcessor, Speech2TextForConditionalGeneration
from datasets import load_dataset
import soundfile as sf

model = Speech2TextForConditionalGeneration.from_pretrained("facebook/s2t-medium-mustc-multilingual-st")
processor = Speech2TextProcessor.from_pretrained("facebook/s2t-medium-mustc-multilingual-st")

def map_to_array(batch):
    speech, _ = sf.read(batch["file"])
    batch["speech"] = speech
    return batch

ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")
ds = ds.map(map_to_array)

inputs = processor(ds["speech"][0], sampling_rate=16_000, return_tensors="pt")

# translate English Speech To French Text
generated_ids = model.generate(
    input_ids=inputs["input_features"],
    attention_mask=inputs["attention_mask"],
    forced_bos_token_id=processor.tokenizer.lang_code_to_id["fr"]
)
translation_fr = processor.batch_decode(generated_ids)

# translate English Speech To German Text
generated_ids = model.generate(
    input_ids=inputs["input_features"],
    attention_mask=inputs["attention_mask"],
    forced_bos_token_id=processor.tokenizer.lang_code_to_id["de"]
)
translation_de = processor.batch_decode(generated_ids, skip_special_tokens=True)

Advanced Usage

As this is a standard sequence to sequence transformer model, you can use the generate method to generate the transcripts by passing the speech features to the model. For multilingual speech translation models, eos_token_id is used as the decoder_start_token_id and the target language id is forced as the first generated token. To force the target language id as the first generated token, pass the forced_bos_token_id parameter to the generate() method.

📚 Documentation

Model description

S2T is a transformer-based seq2seq (encoder-decoder) model designed for end-to-end Automatic Speech Recognition (ASR) and Speech Translation (ST). It uses a convolutional downsampler to reduce the length of speech inputs by 3/4th before they are fed into the encoder. The model is trained with standard autoregressive cross-entropy loss and generates the transcripts/translations autoregressively.

Training data

The s2t-medium-mustc-multilingual-st is trained on MuST-C. MuST-C is a multilingual speech translation corpus whose size and quality facilitates the training of end-to-end systems for speech translation from English into several languages. For each target language, MuST-C comprises several hundred hours of audio recordings from English TED Talks, which are automatically aligned at the sentence level with their manual transcriptions and translations.

Training procedure

Preprocessing

The speech data is pre-processed by extracting Kaldi-compliant 80-channel log mel-filter bank features automatically from WAV/FLAC audio files via PyKaldi or torchaudio. Further utterance-level CMVN (cepstral mean and variance normalization) is applied to each example. The texts are lowercased and tokenized using SentencePiece and a vocabulary size of 10,000.

Training

The model is trained with standard autoregressive cross-entropy loss and using SpecAugment. The encoder receives speech features, and the decoder generates the transcripts autoregressively. To accelerate model training and for better performance the encoder is pre-trained for multilingual ASR. For multilingual models, target language ID token is used as target BOS.

Evaluation results

MuST-C test results (BLEU score):

En-De	En-Nl	En-Es	En-Fr	En-It	En-Pt	En-Ro	En-Ru
24.5	28.6	28.2	34.9	24.6	31.1	23.8	16.0

🔧 Technical Details

The model is based on the S2T architecture proposed in this paper and released in this repository. It uses a convolutional downsampler to reduce the length of speech inputs and is trained with autoregressive cross-entropy loss.

📄 License

This project is licensed under the MIT license.

BibTeX entry and citation info

@inproceedings{wang2020fairseqs2t,
  title = {fairseq S2T: Fast Speech-to-Text Modeling with fairseq},
  author = {Changhan Wang and Yun Tang and Xutai Ma and Anne Wu and Dmytro Okhonko and Juan Pino},
  booktitle = {Proceedings of the 2020 Conference of the Asian Chapter of the Association for Computational Linguistics (AACL): System Demonstrations},
  year = {2020},
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご