The open-source large audio model data2vec-audio-large-960h - Optimize automatic speech recognition tasks

Data2vec Audio Large 960h

Developed by facebook

Data2Vec is a general self-supervised learning framework applicable to speech, vision, and language tasks. This large audio model is pre-trained and fine-tuned on 960 hours of LibriSpeech data, specifically optimized for automatic speech recognition tasks.

Speech Recognition

Transformers

EnglishOpen Source License:Apache-2.0 #High-precision speech recognition #Self-supervised learning #Multimodal unified framework

Downloads 2,531

Release Time : 4/2/2022

Model Overview

A speech recognition model based on the Data2Vec framework, trained using self-supervised learning on the LibriSpeech dataset, capable of converting speech to text.

Model Features

General self-supervised learning framework

Uses the unified data2vec framework to handle different modality tasks by predicting latent representations of the full input rather than local targets

High-performance speech recognition

Achieves WER metrics of 1.89 (clean) and 4.07 (other) on the LibriSpeech test set

Large-scale training data

Trained on 960 hours of LibriSpeech audio data

Model Capabilities

English speech recognition

Audio-to-text conversion

16kHz sampling rate audio processing

Use Cases

Speech transcription

Meeting transcription

Automatically converts meeting recordings into text transcripts

Podcast content indexing

Creates searchable text indexes for podcast audio

Assistive technology

Hearing assistance

Provides real-time speech-to-text services for the hearing impaired

🚀 Data2Vec-Audio-Large-960h

A large model pretrained and fine-tuned on 960 hours of Librispeech for 16kHz sampled speech audio.

This model is a large-scale one that has been pretrained and fine - tuned on 960 hours of Librispeech data with 16kHz sampled speech audio. When using this model, ensure that your speech input is also sampled at 16kHz. Facebook's Data2Vec Paper

Authors: Alexei Baevski, Wei - Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu, Michael Auli

📚 Documentation

Abstract

While the general idea of self - supervised learning is identical across modalities, the actual algorithms and objectives differ widely because they were developed with a single modality in mind. To get us closer to general self - supervised learning, we present data2vec, a framework that uses the same learning method for either speech, NLP or computer vision. The core idea is to predict latent representations of the full input data based on a masked view of the input in a self - distillation setup using a standard Transformer architecture. Instead of predicting modality - specific targets such as words, visual tokens or units of human speech which are local in nature, data2vec predicts contextualized latent representations that contain information from the entire input. Experiments on the major benchmarks of speech recognition, image classification, and natural language understanding demonstrate a new state of the art or competitive performance to predominant approaches.

The original model can be found under https://github.com/pytorch/fairseq/tree/main/examples/data2vec.

🔧 Technical Details

Pre - Training method

model image

For more information, please take a look at the official paper.

💻 Usage Examples

Basic Usage

To transcribe audio files the model can be used as a standalone acoustic model as follows:

 from transformers import Wav2Vec2Processor, Data2VecAudioForCTC
 from datasets import load_dataset
 import torch
 
 # load model and processor
 processor = Wav2Vec2Processor.from_pretrained("facebook/data2vec-audio-large-960h")
 model = Data2VecAudioForCTC.from_pretrained("facebook/data2vec-audio-large-960h")
     
 # load dummy dataset and read soundfiles
 ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")
 
 # tokenize
 input_values = processor(ds[0]["audio"]["array"], return_tensors="pt", padding="longest").input_values  # Batch size 1
 
 # retrieve logits
 logits = model(input_values).logits
 
 # take argmax and decode
 predicted_ids = torch.argmax(logits, dim=-1)
 transcription = processor.batch_decode(predicted_ids)

Advanced Usage

This code snippet shows how to evaluate facebook/data2vec-audio-large-960h on LibriSpeech's "clean" and "other" test data.

 from transformers import Wav2Vec2Processor, Data2VecAudioForCTC
 from datasets import load_dataset
 import torch
 from jiwer import wer
 
 # load model and processor
 processor = Wav2Vec2Processor.from_pretrained("facebook/data2vec-audio-large-960h").to("cuda")
 model = Data2VecAudioForCTC.from_pretrained("facebook/data2vec-audio-large-960h")
 

librispeech_eval = load_dataset("librispeech_asr", "clean", split="test")

def map_to_pred(batch):
    input_values = processor(batch["audio"]["array"], return_tensors="pt", padding="longest").input_values
    with torch.no_grad():
        logits = model(input_values.to("cuda")).logits

    predicted_ids = torch.argmax(logits, dim=-1)
    transcription = processor.batch_decode(predicted_ids)
    batch["transcription"] = transcription
    return batch

result = librispeech_eval.map(map_to_pred, batched=True, batch_size=1, remove_columns=["audio"])

print("WER:", wer(result["text"], result["transcription"]))

Result (WER):

"clean"	"other"
1.89	4.07

📄 License

This project is licensed under the apache - 2.0 license.

📦 Metadata

Property	Details
Datasets	librispeech_asr
Tags	speech, hf - asr - leaderboard
Widget Examples	- Example 1: Librispeech sample 1 - Example 2: Librispeech sample 2
Model Index	- Name: data2vec - audio - large - 960h - Results: - Task: Automatic Speech Recognition - Dataset: LibriSpeech (clean), Test WER: 1.89 - Dataset: LibriSpeech (other), Test WER: 4.07

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご