wav2vec2-large-960h-lv60-self-with-wikipedia-lm Open-source ASR System

Wav2vec2 Large 960h Lv60 Self With Wikipedia Lm

Developed by gxbag

An automatic speech recognition (ASR) system based on Facebook's wav2vec2-large-960h-lv60-self model, improved with an enhanced Wikipedia language model

Speech Recognition

Transformers

#High-precision speech recognition #Wikipedia enhancement #5-gram language model

Downloads 15

Release Time : 4/20/2022

Model Overview

This model combines Facebook's wav2vec2 speech recognition architecture with a 5-gram language model trained on Wikipedia text, enhancing the accuracy of speech-to-text conversion.

Model Features

Enhanced language model

Uses a 5-gram KenLM language model trained on the full Wikipedia text, improving recognition accuracy

Large-scale training

Trained on 960 hours of speech data and over 8 million words of text data

Optimized processing

Wikipedia data was cleaned by removing non-main content such as references and external links

Efficient pruning

All 3-grams and larger singleton words in the language model were pruned to maintain model efficiency

Model Capabilities

English speech recognition

Long audio processing (supports chunk processing)

High-accuracy transcription

Use Cases

Speech transcription

Meeting minutes

Automatically convert meeting recordings into text transcripts

Improves meeting documentation efficiency and facilitates later retrieval

Podcast transcription

Convert podcast content into text versions

Facilitates content indexing and SEO optimization

Assistive technology

Real-time caption generation

Generate real-time captions for videos or live streams

Enhances content accessibility

🚀 facebook/wav2vec2-large-960h-lv60-self Enhanced with Wikipedia Language Model

This project enhances the facebook/wav2vec2-large-960h-lv60-self model with a Wikipedia language model, offering improved performance in speech recognition tasks.

🚀 Quick Start

Installation

The installation process is not explicitly provided in the original README. However, it is assumed that you need to have the transformers library installed. You can install it using the following command:

pip install transformers

Usage

from transformers import pipeline
pipe = pipeline("automatic-speech-recognition", model="gxbag/wav2vec2-large-960h-lv60-self-with-wikipedia-lm")
output = pipe("/path/to/audio.wav", chunk_length_s=30, stride_length_s=(6, 3))
print(output)

✨ Features

Enhanced with Wikipedia Language Model: Leveraging the rich text data from Wikipedia to improve speech recognition accuracy.
KenLM-based Language Model: A 5-gram model built using KenLM, with pruning of singletons of 3-grams and bigger for efficient performance.

📦 Installation

To use this enhanced model, you need to have the transformers library installed. You can install it via pip:

pip install transformers

💻 Usage Examples

Basic Usage

from transformers import pipeline
pipe = pipeline("automatic-speech-recognition", model="gxbag/wav2vec2-large-960h-lv60-self-with-wikipedia-lm")
output = pipe("/path/to/audio.wav", chunk_length_s=30, stride_length_s=(6, 3))
print(output)

Advanced Usage

There is no advanced usage example provided in the original README. If you have specific advanced scenarios, you can refer to the official documentation of the transformers library.

🔧 Technical Details

Dataset

The dataset used is wikipedia/20200501.en, which includes all articles. The dataset was pre - processed by removing references, external links, and all text inside parentheses. It contains 8092546 words.

Language Model

The language model was built using KenLM. It is a 5-gram model where all singletons of 3-grams and bigger were pruned. The command used to build the model is:

kenlm/build/bin/lmplz -o 5 -S 120G --vocab_estimate 8092546 --text text.txt --arpa text.arpa --prune 0 0 1

📄 License

There is no license information provided in the original README, so this section is skipped.

⚠️ Important Note

In the current version of transformers (as of the release of this model), when using striding in the pipeline, it will chop off the last portion of audio (in this case, 3 seconds). As a workaround, add 3 seconds of silence to the end of the audio. This problem was fixed in the GitHub version of transformers.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご