data2vec-audio-large Open-Source Speech Model - Free to Use for Speech Recognition and Other Tasks

Data2vec Audio Large

Developed by facebook

Data2Vec-Audio-Large is a large model pre-trained on 16kHz sampled speech audio using a self-supervised learning framework, suitable for tasks such as speech recognition.

Speech Recognition

Transformers

EnglishOpen Source License:Apache-2.0 #Self-supervised learning #Speech representation learning #Multimodal unified framework

Downloads 97

Release Time : 4/2/2022

Model Overview

This model is the audio implementation version of Facebook's Data2Vec framework, which learns latent representations of speech data through self-distillation and can be applied to tasks such as speech recognition.

Model Features

Unified self-supervised learning framework

Adopts the Data2Vec framework, which can be simultaneously applied to speech, NLP, and computer vision fields.

Contextual latent representation prediction

Unlike predicting local targets, the model predicts contextual latent representations that encompass the entire input information.

16kHz audio support

Specifically optimized for 16kHz sampled speech audio.

Model Capabilities

Speech feature extraction

Self-supervised learning

Speech recognition foundation model

Use Cases

Speech processing

Speech recognition system

Used as a foundation model for building speech recognition systems

Achieves state-of-the-art or surpasses mainstream solutions in speech recognition benchmarks

Speech feature extraction

Extracts high-level feature representations of speech

🚀 Data2Vec-Audio-Large

A large model pretrained on 16kHz sampled speech audio, based on Facebook's Data2Vec framework.

This is a large model that has been pretrained on speech audio sampled at 16kHz. When utilizing this model, ensure that your speech input is also sampled at 16kHz.

⚠️ Important Note

This model does not come with a tokenizer since it was pretrained solely on audio. To use this model for speech recognition, a tokenizer needs to be created, and the model should be fine - tuned on labeled text data. Refer to this blog for a more detailed explanation of how to fine - tune the model.

Property	Details
Model Type	Data2Vec-Audio-Large
Training Data	librispeech_asr
License	apache-2.0

✨ Features

General Framework: Based on Facebook's Data2Vec, it uses the same learning method for speech, NLP, or computer vision.
Contextualized Prediction: Instead of predicting modality - specific local targets, it predicts contextualized latent representations containing information from the entire input.

📚 Documentation

Paper

Authors: Alexei Baevski, Wei - Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu, Michael Auli

Abstract

While the general idea of self - supervised learning is identical across modalities, the actual algorithms and objectives differ widely because they were developed with a single modality in mind. To get us closer to general self - supervised learning, we present data2vec, a framework that uses the same learning method for either speech, NLP or computer vision. The core idea is to predict latent representations of the full input data based on a masked view of the input in a self - distillation setup using a standard Transformer architecture. Instead of predicting modality - specific targets such as words, visual tokens or units of human speech which are local in nature, data2vec predicts contextualized latent representations that contain information from the entire input. Experiments on the major benchmarks of speech recognition, image classification, and natural language understanding demonstrate a new state of the art or competitive performance to predominant approaches.

Pre - Training method

model image

For more information, please refer to the official paper.

Usage

See [this notebook](https://colab.research.google.com/drive/1FjTsqbYKphl9kL - eILgUc - bl4zVThL8F?usp=sharing) for more information on how to fine - tune the model.

The original model can be found under https://github.com/pytorch/fairseq/tree/main/examples/data2vec.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご