Wav2vec2 Large English

Developed by jonatasgrosman

An automatic speech recognition model fine-tuned on English based on facebook/wav2vec2-large, trained using the Common Voice 6.1 dataset

Speech Recognition

Transformers

EnglishOpen Source License:Apache-2.0 #English speech recognition #Low word error rate #Universal speech adaptation

Downloads 355

Release Time : 3/2/2022

Model Overview

A large wav2vec2 model optimized for English speech recognition tasks, supporting voice input with 16kHz sampling rate

Model Features

High-performance English recognition

Achieves 21.53% WER and 9.66% CER on the Common Voice English test set

Based on large pre-trained model

Fine-tuned from facebook/wav2vec2-large model with powerful speech feature extraction capabilities

16kHz sampling rate support

Optimized for voice input with 16kHz sampling rate

Model Capabilities

English speech recognition

Audio to text

Automatic speech transcription

Use Cases

Speech transcription

Automatic meeting minutes transcription

Automatically convert English meeting recordings into text transcripts

Approximately 80% accuracy (based on WER metric)

Podcast content transcription

Automatically convert English podcast episodes into text content

Voice assistants

English voice command recognition

For smart devices' English voice command recognition systems

language: en datasets:

common_voice metrics:
wer
cer tags:
audio
automatic-speech-recognition
speech
xlsr-fine-tuning-week license: apache-2.0 model-index:
name: Wav2Vec2 English by Jonatas Grosman results:
- task: name: Speech Recognition type: automatic-speech-recognition dataset: name: Common Voice en type: common_voice args: en metrics:
  - name: Test WER type: wer value: 21.53
  - name: Test CER type: cer value: 9.66

Fine-tuned wav2vec2 large model for speech recognition in English

Fine-tuned facebook/wav2vec2-large on English using the train and validation splits of Common Voice 6.1. When using this model, make sure that your speech input is sampled at 16kHz.

This model has been fine-tuned thanks to the GPU credits generously given by the OVHcloud :)

The script used for training can be found here: https://github.com/jonatasgrosman/wav2vec2-sprint

Usage

The model can be used directly (without a language model) as follows...

Using the HuggingSound library:

from huggingsound import SpeechRecognitionModel

model = SpeechRecognitionModel("jonatasgrosman/wav2vec2-large-english")
audio_paths = ["/path/to/file.mp3", "/path/to/another_file.wav"]

transcriptions = model.transcribe(audio_paths)

Writing your own inference script:

import torch
import librosa
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

LANG_ID = "en"
MODEL_ID = "jonatasgrosman/wav2vec2-large-english"
SAMPLES = 10

test_dataset = load_dataset("common_voice", LANG_ID, split=f"test[:{SAMPLES}]")

processor = Wav2Vec2Processor.from_pretrained(MODEL_ID)
model = Wav2Vec2ForCTC.from_pretrained(MODEL_ID)

# Preprocessing the datasets.
# We need to read the audio files as arrays
def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = librosa.load(batch["path"], sr=16_000)
    batch["speech"] = speech_array
    batch["sentence"] = batch["sentence"].upper()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
    logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits

predicted_ids = torch.argmax(logits, dim=-1)
predicted_sentences = processor.batch_decode(predicted_ids)

for i, predicted_sentence in enumerate(predicted_sentences):
    print("-" * 100)
    print("Reference:", test_dataset[i]["sentence"])
    print("Prediction:", predicted_sentence)

Reference	Prediction
"SHE'LL BE ALL RIGHT."	SHELL BE ALL RIGHT
SIX	SIX
"ALL'S WELL THAT ENDS WELL."	ALLAS WELL THAT ENDS WELL
DO YOU MEAN IT?	W MEAN IT
THE NEW PATCH IS LESS INVASIVE THAN THE OLD ONE, BUT STILL CAUSES REGRESSIONS.	THE NEW PATCH IS LESS INVASIVE THAN THE OLD ONE BUT STILL CAUSES REGRESTION
HOW IS MOZILLA GOING TO HANDLE AMBIGUITIES LIKE QUEUE AND CUE?	HOW IS MOSILLA GOING TO BANDL AND BE WHIT IS LIKE QU AND QU
"I GUESS YOU MUST THINK I'M KINDA BATTY."	RUSTION AS HAME AK AN THE POT
NO ONE NEAR THE REMOTE MACHINE YOU COULD RING?	NO ONE NEAR THE REMOTE MACHINE YOU COULD RING
SAUCE FOR THE GOOSE IS SAUCE FOR THE GANDER.	SAUCE FOR THE GUCE IS SAUCE FOR THE GONDER
GROVES STARTED WRITING SONGS WHEN SHE WAS FOUR YEARS OLD.	GRAFS STARTED WRITING SONGS WHEN SHE WAS FOUR YEARS OLD

Evaluation

The model can be evaluated as follows on the English (en) test data of Common Voice.

import torch
import re
import librosa
from datasets import load_dataset, load_metric
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

LANG_ID = "en"
MODEL_ID = "jonatasgrosman/wav2vec2-large-english"
DEVICE = "cuda"

CHARS_TO_IGNORE = [",", "?", "¿", ".", "!", "¡", ";", "；", ":", '""', "%", '"', "�", "ʿ", "·", "჻", "~", "՞",
                   "؟", "،", "।", "॥", "«", "»", "„", "“", "”", "「", "」", "‘", "’", "《", "》", "(", ")", "[", "]",
                   "{", "}", "=", "`", "_", "+", "<", ">", "…", "–", "°", "´", "ʾ", "‹", "›", "©", "®", "—", "→", "。",
                   "、", "﹂", "﹁", "‧", "～", "﹏", "，", "｛", "｝", "（", "）", "［", "］", "【", "】", "‥", "〽",
                   "『", "』", "〝", "〟", "⟨", "⟩", "〜", "：", "！", "？", "♪", "؛", "/", "\\", "º", "−", "^", "ʻ", "ˆ"]

test_dataset = load_dataset("common_voice", LANG_ID, split="test")

wer = load_metric("wer.py") # https://github.com/jonatasgrosman/wav2vec2-sprint/blob/main/wer.py
cer = load_metric("cer.py") # https://github.com/jonatasgrosman/wav2vec2-sprint/blob/main/cer.py

chars_to_ignore_regex = f"[{re.escape(''.join(CHARS_TO_IGNORE))}]"

processor = Wav2Vec2Processor.from_pretrained(MODEL_ID)
model = Wav2Vec2ForCTC.from_pretrained(MODEL_ID)
model.to(DEVICE)

# Preprocessing the datasets.
# We need to read the audio files as arrays
def speech_file_to_array_fn(batch):
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")
        speech_array, sampling_rate = librosa.load(batch["path"], sr=16_000)
    batch["speech"] = speech_array
    batch["sentence"] = re.sub(chars_to_ignore_regex, "", batch["sentence"]).upper()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)

# Preprocessing the datasets.
# We need to read the audio files as arrays
def evaluate(batch):
    inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

    with torch.no_grad():
        logits = model(inputs.input_values.to(DEVICE), attention_mask=inputs.attention_mask.to(DEVICE)).logits

    pred_ids = torch.argmax(logits, dim=-1)
    batch["pred_strings"] = processor.batch_decode(pred_ids)
    return batch

result = test_dataset.map(evaluate, batched=True, batch_size=8)

predictions = [x.upper() for x in result["pred_strings"]]
references = [x.upper() for x in result["sentence"]]

print(f"WER: {wer.compute(predictions=predictions, references=references, chunk_size=1000) * 100}")
print(f"CER: {cer.compute(predictions=predictions, references=references, chunk_size=1000) * 100}")

Test Result:

In the table below I report the Word Error Rate (WER) and the Character Error Rate (CER) of the model. I ran the evaluation script described above on other models as well (on 2021-06-17). Note that the table below may show different results from those already reported, this may have been caused due to some specificity of the other evaluation scripts used.

Model	WER	CER
jonatasgrosman/wav2vec2-large-xlsr-53-english	18.98%	8.29%
jonatasgrosman/wav2vec2-large-english	21.53%	9.66%
facebook/wav2vec2-large-960h-lv60-self	22.03%	10.39%
facebook/wav2vec2-large-960h-lv60	23.97%	11.14%
boris/xlsr-en-punctuation	29.10%	10.75%
facebook/wav2vec2-large-960h	32.79%	16.03%
facebook/wav2vec2-base-960h	39.86%	19.89%
facebook/wav2vec2-base-100h	51.06%	25.06%
elgeish/wav2vec2-large-lv60-timit-asr	59.96%	34.28%
facebook/wav2vec2-base-10k-voxpopuli-ft-en	66.41%	36.76%
elgeish/wav2vec2-base-timit-asr	68.78%	36.81%

Citation

If you want to cite this model you can use this:

@misc{grosman2021wav2vec2-large-english,
  title={Fine-tuned wav2vec2 large model for speech recognition in {E}nglish},
  author={Grosman, Jonatas},
  howpublished={\url{https://huggingface.co/jonatasgrosman/wav2vec2-large-english}},
  year={2021}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご