wav2vec2-base-100h Open Source Speech Recognition Model - Supports Precise Recognition of 100 Hours of Speech Data

Home

Wav2vec2 Base 100h

Developed by vuiseng9

Wav2Vec2 base version speech recognition model trained on 100 hours of LibriSpeech data

Speech Recognition

Transformers

EnglishOpen Source License:Apache-2.0 #English Speech Recognition #Low Word Error Rate #LibriSpeech Adaptation

Downloads 26

Release Time : 3/2/2022

Model Overview

This is an automatic speech recognition (ASR) model based on the Wav2Vec2 architecture, trained on 100 hours of English speech data from the LibriSpeech dataset, suitable for English speech-to-text tasks.

Model Features

Efficient Speech Recognition

Achieves a word error rate (WER) of 6.1 (clean) and 13.5 (other) on the LibriSpeech test set

Lightweight Base Model

Compared to larger-scale models, this 100-hour trained base version is more suitable for resource-constrained environments

Strong Compatibility

Verified compatible with transformers v4.15.0 and datasets 1.18.0 versions

Model Capabilities

English Speech Recognition

Audio to Text Conversion

Batch Speech Processing

Use Cases

Speech Transcription

Meeting Minutes Transcription

Automatically convert English meeting recordings into text transcripts

Achieves a 6.1% word error rate in clear speech environments

Educational Content Transcription

Convert English educational audio content into text

Achieves a 13.5% word error rate in complex speech environments

🚀 Wav2Vec2-Base-100h

This is a fork of facebook/wav2vec2-base-100h, which focuses on audio and automatic speech recognition tasks using the LibriSpeech ASR dataset.

✨ Features

Document reproducible evaluation to new transformer and datasets version.
Use a batch size of 1 to reproduce results.
Validated with transformers v4.15.0 and datasets 1.18.0.
Manual installation of pypkg librosa and jiwer might be required.

📦 Installation

You may need to manually install the following pypkgs:

pip install librosa jiwer

💻 Usage Examples

Basic Usage

The following code snippet shows how to evaluate facebook/wav2vec2-base-100h on LibriSpeech's "clean" and "other" test data.

from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import soundfile as sf
import torch
from jiwer import wer

librispeech_eval = load_dataset("librispeech_asr", "clean", split="test")
# librispeech_eval = load_dataset("librispeech_asr", "other", split="test")

model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-100h").to("cuda")
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-100h")

def map_to_array(batch):
    # speech, _ = sf.read(batch["file"])
    # batch["speech"] = speech
    batch["speech"] = batch['audio']['array']
    return batch

librispeech_eval = librispeech_eval.map(map_to_array)

def map_to_pred(batch):
    input_values = processor(batch["speech"], return_tensors="pt", padding="longest").input_values
    with torch.no_grad():
        logits = model(input_values.to("cuda")).logits

    predicted_ids = torch.argmax(logits, dim=-1)
    transcription = processor.batch_decode(predicted_ids)
    batch["transcription"] = transcription
    return batch

result = librispeech_eval.map(map_to_pred, batched=True, batch_size=1, remove_columns=["speech"])

print("WER:", wer(result["text"], result["transcription"]))

Results

Property	Details
WER on "clean/test"	6.1
WER on "other/test"	13.5

📄 License

This project is licensed under the Apache 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご