s2t-large-librispeech-asr Open Source Model - Free to Use for Precise Automatic Speech Recognition

S2t Large Librispeech Asr

Developed by facebook

An end-to-end sequence-to-sequence transformer model for automatic speech recognition (ASR), trained on the LibriSpeech dataset

Speech Recognition

Transformers

EnglishOpen Source License:MIT #End-to-end speech recognition #High-precision WER #English speech-to-text

Downloads 422

Release Time : 3/2/2022

Model Overview

This model is a speech-to-text transformer (S2T) trained using standard autoregressive cross-entropy loss, capable of converting speech signals into corresponding text transcriptions

Model Features

End-to-end model

Directly generates text transcriptions from speech signals without intermediate processing steps

High performance

Achieves WER scores of 3.3 (clean) and 7.5 (other) on the LibriSpeech test set

Transformer-based architecture

Utilizes modern transformer architecture for sequence modeling

Model Capabilities

English speech recognition

Real-time speech-to-text

Long audio processing

Use Cases

Speech transcription

Meeting minutes

Automatically convert meeting recordings into text transcripts

Highly accurate transcription results

Podcast transcription

Convert English podcast content into text

Supports long audio processing

Assistive technology

Hearing assistance

Provide real-time captions for hearing-impaired individuals

Low-latency speech recognition

🚀 S2T-LARGE-LIBRISPEECH-ASR

s2t-large-librispeech-asr is a Speech to Text Transformer (S2T) model designed for automatic speech recognition (ASR). It offers an efficient solution for converting speech to text, leveraging the power of transformer architecture.

✨ Features

End - to - end Transformer Model: S2T is an end - to - end sequence - to - sequence transformer model, trained with standard autoregressive cross - entropy loss to generate transcripts autoregressively.
Versatile Use: Can be used for end - to - end speech recognition (ASR). You can explore other S2T checkpoints on the model hub.

📦 Installation

Note: The Speech2TextProcessor object uses torchaudio to extract the filter bank features. Make sure to install the torchaudio package before running this example.

You could either install those as extra speech dependancies with pip install transformers"[speech, sentencepiece]" or install the packages separately with pip install torchaudio sentencepiece.

💻 Usage Examples

Basic Usage

As this is a standard sequence to sequence transformer model, you can use the generate method to generate the transcripts by passing the speech features to the model.

import torch
from transformers import Speech2TextProcessor, Speech2TextForConditionalGeneration
from datasets import load_dataset
import soundfile as sf

model = Speech2TextForConditionalGeneration.from_pretrained("facebook/s2t-large-librispeech-asr")
processor = Speech2Textprocessor.from_pretrained("facebook/s2t-large-librispeech-asr")

def map_to_array(batch):
    speech, _ = sf.read(batch["file"])
    batch["speech"] = speech
    return batch

ds = load_dataset(
    "patrickvonplaten/librispeech_asr_dummy",
    "clean",
    split="validation"
)
ds = ds.map(map_to_array)

input_features = processor(
    ds["speech"][0],
    sampling_rate=16_000,
    return_tensors="pt"
).input_features  # Batch size 1
generated_ids = model.generate(input_ids=input_features)

transcription = processor.batch_decode(generated_ids)

Advanced Usage

The following script shows how to evaluate this model on the LibriSpeech "clean" and "other" test dataset.

from datasets import load_dataset, load_metric
from transformers import Speech2TextForConditionalGeneration, Speech2TextProcessor
import soundfile as sf

librispeech_eval = load_dataset("librispeech_asr", "clean", split="test")  # change to "other" for other test dataset
wer = load_metric("wer")

model = Speech2TextForConditionalGeneration.from_pretrained("facebook/s2t-large-librispeech-asr").to("cuda")
processor = Speech2TextProcessor.from_pretrained("facebook/s2t-large-librispeech-asr", do_upper_case=True)

def map_to_array(batch):
    speech, _ = sf.read(batch["file"])
    batch["speech"] = speech
    return batch

librispeech_eval = librispeech_eval.map(map_to_array)

def map_to_pred(batch):
    features = processor(batch["speech"], sampling_rate=16000, padding=True, return_tensors="pt")
    input_features = features.input_features.to("cuda")
    attention_mask = features.attention_mask.to("cuda")

    gen_tokens = model.generate(input_ids=input_features, attention_mask=attention_mask)
    batch["transcription"] = processor.batch_decode(gen_tokens, skip_special_tokens=True)
    return batch

result = librispeech_eval.map(map_to_pred, batched=True, batch_size=8, remove_columns=["speech"])

print("WER:", wer(predictions=result["transcription"], references=result["text"]))

Result (WER):

"clean"	"other"
3.3	7.5

📚 Documentation

Model description

S2T is an end - to - end sequence - to - sequence transformer model. It is trained with standard autoregressive cross - entropy loss and generates the transcripts autoregressively.

Intended uses & limitations

This model can be used for end - to - end speech recognition (ASR). See the model hub to look for other S2T checkpoints.

Training data

The S2T - LARGE - LIBRISPEECH - ASR is trained on LibriSpeech ASR Corpus, a dataset consisting of approximately 1000 hours of 16kHz read English speech.

Training procedure

Preprocessing

The speech data is pre - processed by extracting Kaldi - compliant 80 - channel log mel - filter bank features automatically from WAV/FLAC audio files via PyKaldi or torchaudio. Further utterance - level CMVN (cepstral mean and variance normalization) is applied to each example.

The texts are lowercased and tokenized using SentencePiece and a vocabulary size of 10,000.

Training

The model is trained with standard autoregressive cross - entropy loss and using SpecAugment. The encoder receives speech features, and the decoder generates the transcripts autoregressively.

BibTeX entry and citation info

@inproceedings{wang2020fairseqs2t,
  title = {fairseq S2T: Fast Speech - to - Text Modeling with fairseq},
  author = {Changhan Wang and Yun Tang and Xutai Ma and Anne Wu and Dmytro Okhonko and Juan Pino},
  booktitle = {Proceedings of the 2020 Conference of the Asian Chapter of the Association for Computational Linguistics (AACL): System Demonstrations},
  year = {2020},
}

📄 License

This model is licensed under the MIT license.

📊 Model Index

Property	Details
Model Type	S2T - LARGE - LIBRISPEECH - ASR
Training Data	LibriSpeech ASR Corpus
Results	Task: Automatic Speech Recognition Dataset: LibriSpeech (clean), Test WER = 3.3 Dataset: LibriSpeech (other), Test WER = 7.5

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご