# speech-text Open-Source Automatic Speech Recognition Model - Free Deployment with Support for English Speech-to-Text

Speech Text

Developed by abidlabs

An automatic speech recognition model fine-tuned on the English Common Voice dataset based on facebook/wav2vec2-large-xlsr-53, supporting English speech input at 16kHz sampling rate.

Speech Recognition EnglishOpen Source License:Apache-2.0 #English Speech Recognition #Low Word Error Rate #XLSR Fine-tuning

Downloads 25

Release Time : 3/7/2022

Model Overview

This is a model for English Automatic Speech Recognition (ASR), fine-tuned based on the XLSR-53 architecture, capable of converting English speech to text.

Model Features

High-Performance English Speech Recognition

Achieves a Word Error Rate (WER) of 19.06% and a Character Error Rate (CER) of 7.69% on the Common Voice English test set.

Language Model Enhancement Support

When combined with a language model, the Word Error Rate can be reduced to 14.81% and the Character Error Rate to 6.84%.

16kHz Sampling Rate Support

Optimized for speech input at 16kHz sampling rate.

Model Capabilities

English Speech Recognition

Speech-to-Text

Automatic Speech Transcription

Use Cases

Speech Transcription

Meeting Minutes Transcription

Automatically convert English meeting recordings into text transcripts

Accuracy approximately 80-85% (WER 14.81-19.06%)

Podcast Content Transcription

Automatically generate text transcripts for English podcasts

Voice Interface

Voice Assistant

Provide speech recognition capabilities for English voice assistants

🚀 Wav2Vec2-Large-XLSR-53-English

This model is fine-tuned from facebook/wav2vec2-large-xlsr-53 on English using the Common Voice. It's designed for automatic speech recognition tasks.

🚀 Quick Start

Fine-tuned facebook/wav2vec2-large-xlsr-53 on English using the Common Voice. When using this model, make sure that your speech input is sampled at 16kHz.

This model has been fine-tuned thanks to the GPU credits generously given by the OVHcloud :)

The script used for training can be found here: https://github.com/jonatasgrosman/wav2vec2-sprint

✨ Features

Datasets: Utilizes common_voice and mozilla-foundation/common_voice_6_0 for training.
Metrics: Evaluated using wer (Word Error Rate) and cer (Character Error Rate).
Tags: Associated with audio, automatic speech recognition, and other relevant areas.

Property	Details
Model Type	XLSR Wav2Vec2 English by Jonatas Grosman
Training Data	common_voice, mozilla-foundation/common_voice_6_0

📦 Installation

No specific installation steps are provided in the original README. If you want to use the model with the HuggingSound library, you can install it via pip install huggingsound.

💻 Usage Examples

Basic Usage

Using the HuggingSound library:

from huggingsound import SpeechRecognitionModel

model = SpeechRecognitionModel("jonatasgrosman/wav2vec2-large-xlsr-53-english")
audio_paths = ["/path/to/file.mp3", "/path/to/another_file.wav"]

transcriptions = model.transcribe(audio_paths)

Advanced Usage

Writing your own inference script:

import torch
import librosa
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

LANG_ID = "en"
MODEL_ID = "jonatasgrosman/wav2vec2-large-xlsr-53-english"
SAMPLES = 10

test_dataset = load_dataset("common_voice", LANG_ID, split=f"test[:{SAMPLES}]")

processor = Wav2Vec2Processor.from_pretrained(MODEL_ID)
model = Wav2Vec2ForCTC.from_pretrained(MODEL_ID)

# Preprocessing the datasets.
# We need to read the audio files as arrays
def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = librosa.load(batch["path"], sr=16_000)
    batch["speech"] = speech_array
    batch["sentence"] = batch["sentence"].upper()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
    logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits

predicted_ids = torch.argmax(logits, dim=-1)
predicted_sentences = processor.batch_decode(predicted_ids)

for i, predicted_sentence in enumerate(predicted_sentences):
    print("-" * 100)
    print("Reference:", test_dataset[i]["sentence"])
    print("Prediction:", predicted_sentence)

Here is a comparison of reference and prediction results:

Reference	Prediction
"SHE'LL BE ALL RIGHT."	SHE'LL BE ALL RIGHT
SIX	SIX
"ALL'S WELL THAT ENDS WELL."	ALL AS WELL THAT ENDS WELL
DO YOU MEAN IT?	DO YOU MEAN IT
THE NEW PATCH IS LESS INVASIVE THAN THE OLD ONE, BUT STILL CAUSES REGRESSIONS.	THE NEW PATCH IS LESS INVASIVE THAN THE OLD ONE BUT STILL CAUSES REGRESSION
HOW IS MOZILLA GOING TO HANDLE AMBIGUITIES LIKE QUEUE AND CUE?	HOW IS MOSLILLAR GOING TO HANDLE ANDBEWOOTH HIS LIKE Q AND Q
"I GUESS YOU MUST THINK I'M KINDA BATTY."	RUSTIAN WASTIN PAN ONTE BATTLY
NO ONE NEAR THE REMOTE MACHINE YOU COULD RING?	NO ONE NEAR THE REMOTE MACHINE YOU COULD RING
SAUCE FOR THE GOOSE IS SAUCE FOR THE GANDER.	SAUCE FOR THE GUICE IS SAUCE FOR THE GONDER
GROVES STARTED WRITING SONGS WHEN SHE WAS FOUR YEARS OLD.	GRAFS STARTED WRITING SONGS WHEN SHE WAS FOUR YEARS OLD

📚 Documentation

Evaluation

To evaluate on mozilla-foundation/common_voice_6_0 with split test

python eval.py --model_id jonatasgrosman/wav2vec2-large-xlsr-53-english --dataset mozilla-foundation/common_voice_6_0 --config en --split test

To evaluate on speech-recognition-community-v2/dev_data

python eval.py --model_id jonatasgrosman/wav2vec2-large-xlsr-53-english --dataset speech-recognition-community-v2/dev_data --config en --split validation --chunk_length_s 5.0 --stride_length_s 1.0

📄 License

This project is licensed under the apache-2.0 license.

Citation

If you want to cite this model you can use this:

@misc{grosman2021wav2vec2-large-xlsr-53-english,
  title={XLSR Wav2Vec2 English by Jonatas Grosman},
  author={Grosman, Jonatas},
  publisher={Hugging Face},
  journal={Hugging Face Hub},
  howpublished={\url{https://huggingface.co/jonatasgrosman/wav2vec2-large-xlsr-53-english}},
  year={2021}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご