s2t-small-covost2-en-fa-st Open-source Model - Supports English to Persian Speech Translation Tasks

S2t Small Covost2 En Fa St

Developed by facebook

A Transformer-based end-to-end speech translation model for English-to-Persian speech translation tasks

Supports Multiple LanguagesOpen Source License:MIT #English-Persian speech translation #End-to-end speech processing #Transformer-based

Downloads 49

Release Time : 3/2/2022

Model Overview

This model is a sequence-to-sequence speech-to-text (S2T) converter specifically designed for English speech to Persian text translation tasks. It uses a convolutional downsampler to process speech input and employs a Transformer architecture for translation.

Model Features

End-to-end speech translation

Directly generates Persian text output from English speech input without intermediate transcription steps

Convolutional downsampler

Uses convolutional layers to reduce the length of speech input before feeding it to the encoder, improving processing efficiency

Transformer-based architecture

Adopts standard Transformer encoder-decoder structure with excellent sequence modeling capabilities

Multilingual support

Supports English-to-Persian translation tasks

Model Capabilities

Speech translation

English speech recognition

Persian text generation

Use Cases

Speech translation applications

Real-time speech translation

Translates English speech into Persian text in real time

Achieves 11.43 BLEU score on CoVOST2 test set

Meeting transcript translation

Automatically translates English meeting recordings into Persian text transcripts

🚀 S2T-SMALL-COVOST2-EN-FA-ST

This is a Speech to Text Transformer (S2T) model trained for end - to - end Speech Translation (ST), enabling English speech to Farsi text translation.

🚀 Quick Start

This s2t-small-covost2-en-fa-st model can be used for end - to - end English speech to Farsi text translation. As it's a standard sequence - to - sequence transformer model, you can use the generate method to generate transcripts by passing speech features to the model.

⚠️ Important Note

The Speech2TextProcessor object uses torchaudio to extract the filter bank features. Make sure to install the torchaudio package before running this example. You could either install those as extra speech dependancies with pip install transformers"[speech, sentencepiece]" or install the packages separately with pip install torchaudio sentencepiece.

import torch
from transformers import Speech2TextProcessor, Speech2TextForConditionalGeneration
from datasets import load_dataset
import soundfile as sf

model = Speech2TextForConditionalGeneration.from_pretrained("facebook/s2t-small-covost2-en-fa-st")
processor = Speech2TextProcessor.from_pretrained("facebook/s2t-small-covost2-en-fa-st")

def map_to_array(batch):
    speech, _ = sf.read(batch["file"])
    batch["speech"] = speech
    return batch

ds = load_dataset(
    "patrickvonplaten/librispeech_asr_dummy",
    "clean",
    split="validation"
)
ds = ds.map(map_to_array)

inputs = processor(
    ds["speech"][0],
    sampling_rate=48_000,
    return_tensors="pt"
)
generated_ids = model.generate(input_ids=inputs["input_features"], attention_mask=inputs["attention_mask"])

translation = processor.batch_decode(generated_ids, skip_special_tokens=True)

✨ Features

End - to - end ST: Capable of directly translating English speech to Farsi text.
Transformer - based: Utilizes a transformer - based seq2seq (encoder - decoder) architecture.
Pre - trained encoder: The encoder is pre - trained for English ASR to accelerate training and improve performance.

📦 Installation

You can install the necessary packages as extra speech dependencies with:

pip install transformers"[speech, sentencepiece]"

Or install the packages separately:

pip install torchaudio sentencepiece

💻 Usage Examples

Basic Usage

import torch
from transformers import Speech2TextProcessor, Speech2TextForConditionalGeneration
from datasets import load_dataset
import soundfile as sf

model = Speech2TextForConditionalGeneration.from_pretrained("facebook/s2t-small-covost2-en-fa-st")
processor = Speech2TextProcessor.from_pretrained("facebook/s2t-small-covost2-en-fa-st")

def map_to_array(batch):
    speech, _ = sf.read(batch["file"])
    batch["speech"] = speech
    return batch

ds = load_dataset(
    "patrickvonplaten/librispeech_asr_dummy",
    "clean",
    split="validation"
)
ds = ds.map(map_to_array)

inputs = processor(
    ds["speech"][0],
    sampling_rate=48_000,
    return_tensors="pt"
)
generated_ids = model.generate(input_ids=inputs["input_features"], attention_mask=inputs["attention_mask"])

translation = processor.batch_decode(generated_ids, skip_special_tokens=True)

📚 Documentation

Model description

S2T is a transformer - based seq2seq (encoder - decoder) model designed for end - to - end Automatic Speech Recognition (ASR) and Speech Translation (ST). It uses a convolutional downsampler to reduce the length of speech inputs by 3/4th before they are fed into the encoder. The model is trained with standard autoregressive cross - entropy loss and generates the transcripts/translations autoregressively.

Intended uses & limitations

This model can be used for end - to - end English speech to Farsi text translation. See the model hub to look for other S2T checkpoints.

Training data

The s2t - small - covost2 - en - fa - st is trained on the English - Farsi subset of CoVoST2. CoVoST is a large - scale multilingual ST corpus based on Common Voice, created to foster ST research with the largest ever open dataset.

Training procedure

Preprocessing

The speech data is pre - processed by extracting Kaldi - compliant 80 - channel log mel - filter bank features automatically from WAV/FLAC audio files via PyKaldi or torchaudio. Further utterance - level CMVN (cepstral mean and variance normalization) is applied to each example. The texts are lowercased and tokenized using character - based SentencePiece vocab.

Training

The model is trained with standard autoregressive cross - entropy loss and using SpecAugment. The encoder receives speech features, and the decoder generates the transcripts autoregressively. To accelerate model training and for better performance, the encoder is pre - trained for English ASR.

Evaluation results

CoVOST2 test results for en - fa (BLEU score): 11.43

🔧 Technical Details

Model Architecture: Transformer - based seq2seq (encoder - decoder) model.
Pre - processing: Extracts Kaldi - compliant 80 - channel log mel - filter bank features and applies utterance - level CMVN.
Training Loss: Standard autoregressive cross - entropy loss.
Augmentation: Uses SpecAugment.

📄 License

This model is released under the MIT license.

BibTeX entry and citation info

@inproceedings{wang2020fairseqs2t,
  title = {fairseq S2T: Fast Speech-to-Text Modeling with fairseq},
  author = {Changhan Wang and Yun Tang and Xutai Ma and Anne Wu and Dmytro Okhonko and Juan Pino},
  booktitle = {Proceedings of the 2020 Conference of the Asian Chapter of the Association for Computational Linguistics (AACL): System Demonstrations},
  year = {2020},
}

Property	Details
Model Type	Speech to Text Transformer (S2T)
Training Data	English - Farsi subset of CoVoST2

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご