s2t-small-mustc-en-nl-st Open-Source Speech Translation Model - FREE English to Dutch Speech Translation

S2t Small Mustc En Nl St

Developed by facebook

An end-to-end speech translation model based on S2T architecture, specifically designed for English-to-Dutch speech translation tasks

Speech Recognition

Transformers

Supports Multiple LanguagesOpen Source License:MIT #End-to-end speech translation #English-Dutch conversion #TED talk translation

Downloads 20

Release Time : 3/2/2022

Model Overview

This model adopts Transformer architecture, capable of directly converting English speech into Dutch text, suitable for real-time speech translation scenarios

Model Features

End-to-end speech translation

Directly generates translated text from speech input without intermediate transcription steps

Efficient speech processing

Reduces speech input length by 3/4 through convolutional downsampler, improving processing efficiency

Multilingual support

Focuses on English-to-Dutch translation but supports extension to other language pairs

Data augmentation

Uses SpecAugment technique during training to enhance data diversity

Model Capabilities

Speech recognition

Speech translation

English-to-Dutch translation

Real-time speech processing

Use Cases

Real-time translation

Conference real-time translation

Real-time translation of English speeches into Dutch subtitles

Provides smooth cross-language communication experience

Multimedia content translation

Translates English video/audio content into Dutch subtitles

Helps Dutch users understand English content

Assistive tools

Language learning assistance

Helps Dutch learners understand English speech content

Improves language learning efficiency

🚀 S2T-SMALL-MUSTC-EN-NL-ST

s2t-small-mustc-en-nl-st is a Speech to Text Transformer (S2T) model trained for end - to - end Speech Translation (ST), which solves the problem of translating English speech to Dutch text.

🚀 Quick Start

As this is a standard sequence to sequence transformer model, you can use the generate method to generate the transcripts by passing the speech features to the model.

Note: The Speech2TextProcessor object uses torchaudio to extract the filter bank features. Make sure to install the torchaudio package before running this example.

You could either install those as extra speech dependancies with pip install transformers"[speech, sentencepiece]" or install the packages separately with pip install torchaudio sentencepiece.

import torch
from transformers import Speech2TextProcessor, Speech2TextForConditionalGeneration
from datasets import load_dataset
import soundfile as sf

model = Speech2TextForConditionalGeneration.from_pretrained("facebook/s2t-small-mustc-en-nl-st")
processor = Speech2TextProcessor.from_pretrained("facebook/s2t-small-mustc-en-nl-st")

def map_to_array(batch):
    speech, _ = sf.read(batch["file"])
    batch["speech"] = speech
    return batch

ds = load_dataset(
    "patrickvonplaten/librispeech_asr_dummy",
    "clean",
    split="validation"
)
ds = ds.map(map_to_array)

inputs = processor(
    ds["speech"][0],
    sampling_rate=16_000,
    return_tensors="pt"
)
generated_ids = model.generate(input_ids=inputs["input_features"], attention_mask=inputs["attention_mask"])

translation = processor.batch_decode(generated_ids, skip_special_tokens=True)

✨ Features

Trained for end - to - end Speech Translation (ST) from English to Dutch.
Based on the S2T transformer architecture, suitable for Automatic Speech Recognition (ASR) and Speech Translation (ST).
Uses a convolutional downsampler to reduce speech input length before encoding.
Trained with standard autoregressive cross - entropy loss.

📦 Installation

You could either install those as extra speech dependancies with pip install transformers"[speech, sentencepiece]" or install the packages separately with pip install torchaudio sentencepiece.

💻 Usage Examples

Basic Usage

import torch
from transformers import Speech2TextProcessor, Speech2TextForConditionalGeneration
from datasets import load_dataset
import soundfile as sf

model = Speech2TextForConditionalGeneration.from_pretrained("facebook/s2t-small-mustc-en-nl-st")
processor = Speech2TextProcessor.from_pretrained("facebook/s2t-small-mustc-en-nl-st")

def map_to_array(batch):
    speech, _ = sf.read(batch["file"])
    batch["speech"] = speech
    return batch

ds = load_dataset(
    "patrickvonplaten/librispeech_asr_dummy",
    "clean",
    split="validation"
)
ds = ds.map(map_to_array)

inputs = processor(
    ds["speech"][0],
    sampling_rate=16_000,
    return_tensors="pt"
)
generated_ids = model.generate(input_ids=inputs["input_features"], attention_mask=inputs["attention_mask"])

translation = processor.batch_decode(generated_ids, skip_special_tokens=True)

📚 Documentation

Model description

S2T is a transformer - based seq2seq (encoder - decoder) model designed for end - to - end Automatic Speech Recognition (ASR) and Speech Translation (ST). It uses a convolutional downsampler to reduce the length of speech inputs by 3/4th before they are fed into the encoder. The model is trained with standard autoregressive cross - entropy loss and generates the transcripts/translations autoregressively.

Intended uses & limitations

This model can be used for end - to - end English speech to Dutch text translation. See the model hub to look for other S2T checkpoints.

Training data

The s2t - small - mustc - en - nl - st is trained on English - Dutch subset of MuST - C. MuST - C is a multilingual speech translation corpus whose size and quality facilitates the training of end - to - end systems for speech translation from English into several languages. For each target language, MuST - C comprises several hundred hours of audio recordings from English TED Talks, which are automatically aligned at the sentence level with their manual transcriptions and translations.

Training procedure

Preprocessing

The speech data is pre - processed by extracting Kaldi - compliant 80 - channel log mel - filter bank features automatically from WAV/FLAC audio files via PyKaldi or torchaudio. Further utterance - level CMVN (cepstral mean and variance normalization) is applied to each example.

The texts are lowercased and tokenized using SentencePiece and a vocabulary size of 8,000.

Training

The model is trained with standard autoregressive cross - entropy loss and using SpecAugment. The encoder receives speech features, and the decoder generates the transcripts autoregressively. To accelerate model training and for better performance the encoder is pre - trained for English ASR.

Evaluation results

MuST - C test results for en - nl (BLEU score): 27.3

🔧 Technical Details

The S2T model was proposed in this paper and released in this repository.
The model uses a convolutional downsampler to reduce speech input length.
It is trained with autoregressive cross - entropy loss and SpecAugment.

📄 License

This project is licensed under the MIT license.

BibTeX entry and citation info

@inproceedings{wang2020fairseqs2t,
  title = {fairseq S2T: Fast Speech-to-Text Modeling with fairseq},
  author = {Changhan Wang and Yun Tang and Xutai Ma and Anne Wu and Dmytro Okhonko and Juan Pino},
  booktitle = {Proceedings of the 2020 Conference of the Asian Chapter of the Association for Computational Linguistics (AACL): System Demonstrations},
  year = {2020},
}

Additional Information

Property	Details
Language	English, Dutch
Datasets	MuST - C
Tags	audio, speech - translation, automatic - speech - recognition
Pipeline Tag	automatic - speech - recognition
Widget Examples	Librispeech sample 1, Librispeech sample 2

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご