Open-source large audio model data2vec-audio-large-10m - Adapted to voice tasks, free to experience voice processing

Data2vec Audio Large 10m

Developed by facebook

Data2Vec is a general self-supervised learning framework applicable to speech, vision, and language tasks. This large audio model is pre-trained and fine-tuned on 10 minutes of Librispeech data, suitable for 16kHz sampled speech audio.

Speech Recognition

Transformers

EnglishOpen Source License:Apache-2.0 #Self-supervised speech recognition #Multimodal unified framework #16kHz audio processing

Downloads 19

Release Time : 4/2/2022

Model Overview

Data2Vec-Audio-Large-10m is a self-supervised learning-based speech processing model primarily used for speech recognition tasks. It employs a unified framework to handle different data modalities, achieving efficient learning by predicting latent representations of complete input data.

Model Features

Unified self-supervised learning framework

Uses the same learning approach for speech, natural language processing, and computer vision tasks, achieving cross-modal unified learning.

Context-aware latent representation prediction

Unlike predicting local property targets, this model predicts context-aware latent representations containing complete input information.

High-performance

Achieves state-of-the-art or competitive performance on major benchmarks for speech recognition, image classification, and natural language understanding.

Model Capabilities

Speech recognition

Audio feature extraction

Use Cases

Speech processing

Speech-to-text

Convert speech audio into text content

High-accuracy speech recognition results

🚀 Data2Vec-Audio-Large-10m

This is a large model that has been pre - trained and fine - tuned on 10 minutes of Librispeech with 16kHz sampled speech audio. It offers a unified self - supervised learning approach across speech, NLP, and computer vision.

🚀 Quick Start

This model is a large one pretrained and fine - tuned on 10 minutes of Librispeech with 16kHz sampled speech audio. When using the model, ensure that your speech input is also sampled at 16Khz.

✨ Features

Unified Framework: Based on Facebook's Data2Vec, it uses the same learning method for speech, NLP, or computer vision.
State - of - the - art Performance: Experiments on major benchmarks of speech recognition, image classification, and natural language understanding show new state - of - the - art or competitive performance.

📚 Documentation

Abstract

While the general idea of self - supervised learning is identical across modalities, the actual algorithms and objectives differ widely because they were developed with a single modality in mind. To get us closer to general self - supervised learning, we present data2vec, a framework that uses the same learning method for either speech, NLP or computer vision. The core idea is to predict latent representations of the full input data based on a masked view of the input in a self - distillation setup using a standard Transformer architecture. Instead of predicting modality - specific targets such as words, visual tokens or units of human speech which are local in nature, data2vec predicts contextualized latent representations that contain information from the entire input. Experiments on the major benchmarks of speech recognition, image classification, and natural language understanding demonstrate a new state of the art or competitive performance to predominant approaches.

The original model can be found under https://github.com/pytorch/fairseq/tree/main/examples/data2vec.

Pre - Training method

model image

For more information, please take a look at the official paper.

Paper and Authors

Paper
Authors: Alexei Baevski, Wei - Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu, Michael Auli

💻 Usage Examples

Basic Usage

from transformers import Wav2Vec2Processor, Data2VecForCTC
from datasets import load_dataset
import torch

# load model and processor
processor = Wav2Vec2Processor.from_pretrained("facebook/data2vec-audio-large-10m")
model = Data2VecForCTC.from_pretrained("facebook/data2vec-audio-large-10m")

# load dummy dataset and read soundfiles
ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")

# tokenize
input_values = processor(ds[0]["audio"]["array"], return_tensors="pt", padding="longest").input_values  # Batch size 1

# retrieve logits
logits = model(input_values).logits

# take argmax and decode
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)

📄 License

This project is licensed under the apache - 2.0 license.

Property	Details
Datasets	librispeech_asr
Tags	speech

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご