Data2vec Audio Base
A general self-supervised learning framework developed by Facebook, supporting multi-modal tasks including speech, text, and vision
Downloads 5,694
Release Time : 3/2/2022
Model Overview
A general self-supervised learning model pre-trained on 16kHz sampled speech audio, using a unified framework for multi-modal tasks, achieving cross-modal learning by predicting latent representations rather than specific targets
Model Features
Multi-modal unified framework
The first unified self-supervised learning architecture for speech, NLP, and CV modalities
Global representation prediction
Predicts latent representations containing global contextual information, rather than traditional local targets (e.g., words/visual tokens)
Self-distillation architecture
Predicts latent representations of complete inputs by masking input views, achieving knowledge distillation
Model Capabilities
Speech feature extraction
Cross-modal representation learning
Speech recognition base model (requires fine-tuning)
Use Cases
Speech processing
Speech recognition system
Used as a base model for fine-tuning in ASR tasks
Paper reports SOTA performance on the LibriSpeech benchmark
Speech content analysis
Extracts deep semantic representations of speech for content understanding
Featured Recommended AI Models
Š 2025AIbase