Assignment 1 - Joane Open-source Speech-to-Text Converter - Achieve Automatic Speech Recognition for Free

Home

Assignment1 Joane

Developed by Classroom-workshop

A speech-to-text (S2T) model for automatic speech recognition (ASR)

Speech Recognition

Transformers

EnglishOpen Source License:MIT #End-to-end speech recognition #High-precision WER #English speech transcription

Downloads 22

Release Time : 6/2/2022

Model Overview

This model is an end-to-end sequence-to-sequence transformer trained with standard autoregressive cross-entropy loss and generates transcriptions autoregressively.

Model Features

End-to-end model

Generates text directly from speech features without intermediate processing steps

High accuracy

Achieves excellent performance of 4.3 (WER, clean) and 9.0 (WER, other) on LibriSpeech test sets

Autoregressive generation

Generates transcriptions autoregressively to improve output quality

Model Capabilities

English speech recognition

End-to-end speech-to-text

Real-time speech transcription

Use Cases

Speech transcription

Meeting minutes

Automatically convert meeting recordings into text transcripts

Highly accurate transcripts

Voice notes

Convert voice memos into searchable text

Easily retrievable and organized text content

Assistive technology

Hearing assistance

Provide real-time captions for the hearing impaired

Improved accessibility

🚀 S2T-SMALL-LIBRISPEECH-ASR

s2t-small-librispeech-asr is a Speech to Text Transformer (S2T) model designed for automatic speech recognition (ASR). It offers an efficient solution for converting speech into text, leveraging the power of transformers.

🚀 Quick Start

How to use

As this a standard sequence to sequence transformer model, you can use the generate method to generate the transcripts by passing the speech features to the model.

Note: The Speech2TextProcessor object uses torchaudio to extract the filter bank features. Make sure to install the torchaudio package before running this example.

Note: The feature extractor depends on torchaudio and the tokenizer depends on sentencepiece so be sure to install those packages before running the examples.

You could either install those as extra speech dependancies with pip install transformers"[speech, sentencepiece]" or install the packages seperatly with pip install torchaudio sentencepiece.

import torch
from transformers import Speech2TextProcessor, Speech2TextForConditionalGeneration
from datasets import load_dataset

model = Speech2TextForConditionalGeneration.from_pretrained("facebook/s2t-small-librispeech-asr")
processor = Speech2TextProcessor.from_pretrained("facebook/s2t-small-librispeech-asr")

ds = load_dataset(
    "patrickvonplaten/librispeech_asr_dummy",
    "clean",
    split="validation"
)

input_features = processor(
    ds[0]["audio"]["array"],
    sampling_rate=16_000,
    return_tensors="pt"
).input_features  # Batch size 1
generated_ids = model.generate(input_ids=input_features)

transcription = processor.batch_decode(generated_ids)

Evaluation on LibriSpeech Test

The following script shows how to evaluate this model on the LibriSpeech "clean" and "other" test dataset.

from datasets import load_dataset, load_metric
from transformers import Speech2TextForConditionalGeneration, Speech2TextProcessor

librispeech_eval = load_dataset("librispeech_asr", "clean", split="test")  # change to "other" for other test dataset
wer = load_metric("wer")

model = Speech2TextForConditionalGeneration.from_pretrained("facebook/s2t-small-librispeech-asr").to("cuda")
processor = Speech2TextProcessor.from_pretrained("facebook/s2t-small-librispeech-asr", do_upper_case=True)

librispeech_eval = librispeech_eval.map(map_to_array)

def map_to_pred(batch):
    features = processor(batch["audio"]["array"], sampling_rate=16000, padding=True, return_tensors="pt")
    input_features = features.input_features.to("cuda")
    attention_mask = features.attention_mask.to("cuda")

    gen_tokens = model.generate(input_ids=input_features, attention_mask=attention_mask)
    batch["transcription"] = processor.batch_decode(gen_tokens, skip_special_tokens=True)
    return batch

result = librispeech_eval.map(map_to_pred, batched=True, batch_size=8, remove_columns=["speech"])

print("WER:", wer(predictions=result["transcription"], references=result["text"]))

Result (WER):

"clean"	"other"
4.3	9.0

✨ Features

End-to-End ASR: The S2T model is an end-to-end sequence-to-sequence transformer model, trained with standard autoregressive cross-entropy loss and generating transcripts autoregressively.
Trained on LibriSpeech: It is trained on the LibriSpeech ASR Corpus, a dataset with approximately 1000 hours of 16kHz read English speech.

📚 Documentation

Model description

S2T is an end-to-end sequence-to-sequence transformer model. It is trained with standard autoregressive cross-entropy loss and generates the transcripts autoregressively.

Intended uses & limitations

This model can be used for end-to-end speech recognition (ASR). See the model hub to look for other S2T checkpoints.

🔧 Technical Details

Training data

The S2T-SMALL-LIBRISPEECH-ASR is trained on LibriSpeech ASR Corpus, a dataset consisting of approximately 1000 hours of 16kHz read English speech.

Training procedure

Preprocessing

The speech data is pre-processed by extracting Kaldi-compliant 80-channel log mel-filter bank features automatically from WAV/FLAC audio files via PyKaldi or torchaudio. Further utterance-level CMVN (cepstral mean and variance normalization) is applied to each example.

The texts are lowercased and tokenized using SentencePiece and a vocabulary size of 10,000.

Training

The model is trained with standard autoregressive cross-entropy loss and using SpecAugment. The encoder receives speech features, and the decoder generates the transcripts autoregressively.

BibTeX entry and citation info

@inproceedings{wang2020fairseqs2t,
  title = {fairseq S2T: Fast Speech-to-Text Modeling with fairseq},
  author = {Changhan Wang and Yun Tang and Xutai Ma and Anne Wu and Dmytro Okhonko and Juan Pino},
  booktitle = {Proceedings of the 2020 Conference of the Asian Chapter of the Association for Computational Linguistics (AACL): System Demonstrations},
  year = {2020},
}

📄 License

This model is released under the MIT license.

📦 Additional Information

Property	Details
Model Type	Speech to Text Transformer (S2T)
Training Data	LibriSpeech ASR Corpus
Task	Automatic Speech Recognition
Metrics	Test WER (clean: 4.3, other: 9.0)
Tags	speech, audio, automatic-speech-recognition, hf-asr-leaderboard
Pipeline Tag	automatic-speech-recognition

Widget Examples

Model Index

Name: s2t-small-librispeech-asr
- Results:
  - Task: Automatic Speech Recognition
    - Dataset: LibriSpeech (clean)
    - Metrics: Test WER = 4.3
  - Task: Automatic Speech Recognition
    - Dataset: LibriSpeech (other)
    - Metrics: Test WER = 9.0

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご