s2t-small-librispeech-asr Open-source Model - Free Automatic Speech Recognition to Text Function

S2t Small Librispeech Asr

Developed by facebook

A speech-to-text (S2T) model for automatic speech recognition (ASR), based on a sequence-to-sequence transformer architecture

Speech Recognition

Transformers

EnglishOpen Source License:MIT #End-to-end speech recognition #High-precision WER #English speech transcription

Downloads 10.92k

Release Time : 3/2/2022

Model Overview

This model is an end-to-end speech recognition model trained using standard autoregressive cross-entropy loss, capable of converting speech into text

Model Features

End-to-end speech recognition

Directly generates text output from speech input without intermediate processing steps

Transformer-based architecture

Utilizes advanced sequence-to-sequence transformer model architecture

High accuracy

Performs exceptionally well on the LibriSpeech test set, with a WER of 4.3 on the clean test set and 9.0 on the other test set

Model Capabilities

English speech recognition

End-to-end speech-to-text conversion

Long audio processing

Use Cases

Speech transcription

Audio content transcription

Convert English speech content into text format

Highly accurate transcription results

Assistive technology

Real-time caption generation

Generate real-time captions for English videos or live streams

🚀 S2T-SMALL-LIBRISPEECH-ASR

s2t-small-librispeech-asr is a Speech to Text Transformer (S2T) model designed for automatic speech recognition (ASR). It offers an effective solution for converting spoken language into text, leveraging the power of the Transformer architecture.

🚀 Quick Start

s2t-small-librispeech-asr is a Speech to Text Transformer (S2T) model trained for automatic speech recognition (ASR). The S2T model was proposed in this paper and released in this repository.

✨ Features

End - to - end ASR: S2T is an end - to - end sequence - to - sequence transformer model. It is trained with standard autoregressive cross - entropy loss and generates the transcripts autoregressively.
Multiple Datasets Support: Can be evaluated on different subsets of the LibriSpeech dataset, like "clean" and "other".

📦 Installation

To use this model, you need to install some dependencies. You could either install those as extra speech dependancies with pip install transformers"[speech, sentencepiece]" or install the packages separately with pip install torchaudio sentencepiece.

💻 Usage Examples

Basic Usage

import torch
from transformers import Speech2TextProcessor, Speech2TextForConditionalGeneration
from datasets import load_dataset

model = Speech2TextForConditionalGeneration.from_pretrained("facebook/s2t-small-librispeech-asr")
processor = Speech2TextProcessor.from_pretrained("facebook/s2t-small-librispeech-asr")

ds = load_dataset(
    "patrickvonplaten/librispeech_asr_dummy",
    "clean",
    split="validation"
)

input_features = processor(
    ds[0]["audio"]["array"],
    sampling_rate=16_000,
    return_tensors="pt"
).input_features  # Batch size 1
generated_ids = model.generate(input_features=input_features)

transcription = processor.batch_decode(generated_ids)

Advanced Usage

The following script shows how to evaluate this model on the LibriSpeech "clean" and "other" test dataset.

from datasets import load_dataset
from evaluate import load
from transformers import Speech2TextForConditionalGeneration, Speech2TextProcessor

librispeech_eval = load_dataset("librispeech_asr", "clean", split="test")  # change to "other" for other test dataset
wer = load("wer")

model = Speech2TextForConditionalGeneration.from_pretrained("facebook/s2t-small-librispeech-asr").to("cuda")
processor = Speech2TextProcessor.from_pretrained("facebook/s2t-small-librispeech-asr", do_upper_case=True)

def map_to_pred(batch):
    features = processor(batch["audio"]["array"], sampling_rate=16000, padding=True, return_tensors="pt")
    input_features = features.input_features.to("cuda")
    attention_mask = features.attention_mask.to("cuda")

    gen_tokens = model.generate(input_features=input_features, attention_mask=attention_mask)
    batch["transcription"] = processor.batch_decode(gen_tokens, skip_special_tokens=True)[0]
    return batch

result = librispeech_eval.map(map_to_pred, remove_columns=["audio"])

print("WER:", wer.compute(predictions=result["transcription"], references=result["text"]))

Result (WER):

"clean"	"other"
4.3	9.0

📚 Documentation

Intended uses & limitations

This model can be used for end - to - end speech recognition (ASR). See the model hub to look for other S2T checkpoints.

Model Information

Property	Details
Model Type	Speech to Text Transformer (S2T) for automatic speech recognition
Training Data	LibriSpeech ASR Corpus, approximately 1000 hours of 16kHz read English speech

Training procedure

Preprocessing

The speech data is pre - processed by extracting Kaldi - compliant 80 - channel log mel - filter bank features automatically from WAV/FLAC audio files via PyKaldi or torchaudio. Further utterance - level CMVN (cepstral mean and variance normalization) is applied to each example. The texts are lowercased and tokenized using SentencePiece and a vocabulary size of 10,000.

Training

The model is trained with standard autoregressive cross - entropy loss and using SpecAugment. The encoder receives speech features, and the decoder generates the transcripts autoregressively.

BibTeX entry and citation info

@inproceedings{wang2020fairseqs2t,
  title = {fairseq S2T: Fast Speech-to-Text Modeling with fairseq},
  author = {Changhan Wang and Yun Tang and Xutai Ma and Anne Wu and Dmytro Okhonko and Juan Pino},
  booktitle = {Proceedings of the 2020 Conference of the Asian Chapter of the Association for Computational Linguistics (AACL): System Demonstrations},
  year = {2020},
}

📄 License

This project is licensed under the MIT license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご