Open-source data2vec-audio-large-100h Speech Model - Supports Multi-domain Tasks and is Pre-trained for 100 Hours

Data2vec Audio Large 100h

Developed by facebook

Data2Vec is a general self-supervised learning framework applicable to speech, natural language processing, and computer vision tasks. This model is a large-scale model pre-trained and fine-tuned on 100 hours of Librispeech audio data.

Speech Recognition

Transformers

EnglishOpen Source License:Apache-2.0 #Self-supervised speech recognition #Multimodal unified framework #16kHz audio adaptation

Downloads 46

Release Time : 4/2/2022

Model Overview

Data2Vec-Audio-Large-100h is a self-supervised learning-based speech recognition model capable of processing 16kHz sampled audio inputs and outputting corresponding text transcriptions.

Model Features

General self-supervised learning framework

The Data2Vec framework can handle speech, natural language processing, and computer vision tasks with the same learning approach, achieving unified cross-modal learning.

Self-distillation setup

The model predicts latent representations of complete input data using a standard Transformer architecture based on masked views of the input, rather than local property targets.

High performance

This method achieves new state-of-the-art or competitive performance with mainstream approaches in major benchmarks such as speech recognition, image classification, and natural language understanding.

Model Capabilities

Speech recognition

Audio transcription

Use Cases

Speech transcription

Audio file transcription

Transcribe 16kHz sampled speech audio files into text.

Highly accurate text output

🚀 Data2Vec-Audio-Large-100h

A large model pre - trained and fine - tuned on 100 hours of Librispeech audio sampled at 16kHz.

🚀 Quick Start

This is a large model that has been pre - trained and fine - tuned on 100 hours of Librispeech audio sampled at 16kHz. When using the model, ensure that your speech input is also sampled at 16kHz.

✨ Features

General Self - Supervised Learning: Based on Facebook's Data2Vec, it uses the same learning method for speech, NLP, or computer vision.
State - of - the - Art Performance: Experiments on major benchmarks of speech recognition, image classification, and natural language understanding show state - of - the - art or competitive performance.

📚 Documentation

Paper

Authors

Alexei Baevski, Wei - Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu, Michael Auli

Abstract

While the general idea of self - supervised learning is identical across modalities, the actual algorithms and objectives differ widely because they were developed with a single modality in mind. To get us closer to general self - supervised learning, we present data2vec, a framework that uses the same learning method for either speech, NLP or computer vision. The core idea is to predict latent representations of the full input data based on a masked view of the input in a self - distillation setup using a standard Transformer architecture. Instead of predicting modality - specific targets such as words, visual tokens or units of human speech which are local in nature, data2vec predicts contextualized latent representations that contain information from the entire input. Experiments on the major benchmarks of speech recognition, image classification, and natural language understanding demonstrate a new state of the art or competitive performance to predominant approaches.

Pre - Training method

model image

For more information, please take a look at the official paper.

💻 Usage Examples

Basic Usage

 from transformers import Wav2Vec2Processor, Data2VecForCTC
 from datasets import load_dataset
 import torch
 
 # load model and processor
 processor = Wav2Vec2Processor.from_pretrained("facebook/data2vec-audio-large-100h")
 model = Data2VecForCTC.from_pretrained("facebook/data2vec-audio-large-100h")
     
 # load dummy dataset and read soundfiles
 ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")
 
 # tokenize
 input_values = processor(ds[0]["audio"]["array"], return_tensors="pt", padding="longest").input_values  # Batch size 1
 
 # retrieve logits
 logits = model(input_values).logits
 
 # take argmax and decode
 predicted_ids = torch.argmax(logits, dim=-1)
 transcription = processor.batch_decode(predicted_ids)

📄 License

This project is licensed under the Apache - 2.0 license.

📦 Additional Information

Property	Details
Datasets	librispeech_asr
Tags	speech
Original Model	https://github.com/pytorch/fairseq/tree/main/examples/data2vec

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご