data2vec-audio-base Open Source Audio Model - Supports Speech, Text, and Visual Multimodal Tasks

Data2vec Audio Base

Developed by facebook

A general self-supervised learning framework developed by Facebook, supporting multi-modal tasks including speech, text, and vision

Speech Recognition

Transformers

EnglishOpen Source License:Apache-2.0 #Self-supervised audio representation #Cross-modal unified framework #16kHz speech processing

Downloads 5,694

Release Time : 3/2/2022

Model Overview

A general self-supervised learning model pre-trained on 16kHz sampled speech audio, using a unified framework for multi-modal tasks, achieving cross-modal learning by predicting latent representations rather than specific targets

Model Features

Multi-modal unified framework

The first unified self-supervised learning architecture for speech, NLP, and CV modalities

Global representation prediction

Predicts latent representations containing global contextual information, rather than traditional local targets (e.g., words/visual tokens)

Self-distillation architecture

Predicts latent representations of complete inputs by masking input views, achieving knowledge distillation

Model Capabilities

Speech feature extraction

Cross-modal representation learning

Speech recognition base model (requires fine-tuning)

Use Cases

Speech processing

Speech recognition system

Used as a base model for fine-tuning in ASR tasks

Paper reports SOTA performance on the LibriSpeech benchmark

Speech content analysis

Extracts deep semantic representations of speech for content understanding

🚀 Data2Vec-Audio-Base

The base model pretrained on 16kHz sampled speech audio, offering a unified self - supervised learning approach for speech.

🚀 Quick Start

This is a base model pretrained on 16kHz sampled speech audio. When using the model, ensure that your speech input is also sampled at 16kHz.

⚠️ Important Note

This model does not have a tokenizer as it was pretrained on audio alone. To use this model for speech recognition, a tokenizer should be created and the model should be fine - tuned on labeled text data. Check out this blog for a more in - depth explanation of how to fine - tune the model.

✨ Features

Unified Self - Supervised Learning: Based on Facebook's Data2Vec, it uses the same learning method for speech, NLP, or computer vision.
Contextualized Prediction: Instead of predicting modality - specific targets, it predicts contextualized latent representations containing information from the entire input.

📚 Documentation

📄 Paper

Paper

Abstract

While the general idea of self - supervised learning is identical across modalities, the actual algorithms and objectives differ widely because they were developed with a single modality in mind. To get us closer to general self - supervised learning, we present data2vec, a framework that uses the same learning method for either speech, NLP or computer vision. The core idea is to predict latent representations of the full input data based on a masked view of the input in a self - distillation setup using a standard Transformer architecture. Instead of predicting modality - specific targets such as words, visual tokens or units of human speech which are local in nature, data2vec predicts contextualized latent representations that contain information from the entire input. Experiments on the major benchmarks of speech recognition, image classification, and natural language understanding demonstrate a new state of the art or competitive performance to predominant approaches.

👨‍💻 Authors

Authors: Alexei Baevski, Wei - Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu, Michael Auli

📦 Model Source

The original model can be found under https://github.com/pytorch/fairseq/tree/main/examples/data2vec .

🎨 Pre - Training Method

model image

For more information, please take a look at the official paper.

💻 Usage Examples

Basic Usage

See [this notebook](https://colab.research.google.com/drive/1FjTsqbYKphl9kL - eILgUc - bl4zVThL8F?usp=sharing) for more information on how to fine - tune the model.

📄 License

This project is licensed under the Apache 2.0 license.

Property	Details
Datasets	librispeech_asr
Tags	speech
License	apache - 2.0

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご