D

Data2vec Audio Base

Developed by facebook
A general self-supervised learning framework developed by Facebook, supporting multi-modal tasks including speech, text, and vision
Downloads 5,694
Release Time : 3/2/2022

Model Overview

A general self-supervised learning model pre-trained on 16kHz sampled speech audio, using a unified framework for multi-modal tasks, achieving cross-modal learning by predicting latent representations rather than specific targets

Model Features

Multi-modal unified framework
The first unified self-supervised learning architecture for speech, NLP, and CV modalities
Global representation prediction
Predicts latent representations containing global contextual information, rather than traditional local targets (e.g., words/visual tokens)
Self-distillation architecture
Predicts latent representations of complete inputs by masking input views, achieving knowledge distillation

Model Capabilities

Speech feature extraction
Cross-modal representation learning
Speech recognition base model (requires fine-tuning)

Use Cases

Speech processing
Speech recognition system
Used as a base model for fine-tuning in ASR tasks
Paper reports SOTA performance on the LibriSpeech benchmark
Speech content analysis
Extracts deep semantic representations of speech for content understanding
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
Š 2025AIbase