AV HuBERT MuAViC Ru

Developed by nguyenvulebinh

AV-HuBERT is an audio-visual speech recognition model trained on the MuAViC multilingual audio-visual corpus, combining audio and visual modalities for robust performance.

Audio-to-Text

Transformers

#Audio-Visual Speech Recognition #Multilingual Support #Self-Supervised Learning

Downloads 91

Release Time : 3/6/2025

Model Overview

AV-HuBERT is a self-supervised model designed for audio-visual speech recognition, achieving robust performance by integrating audio and visual modalities, especially excelling in noisy environments.

Model Features

Multilingual Support

Supports multiple languages including Arabic, German, Greek, English, Spanish, French, Italian, Portuguese, and Russian.

Audio-Visual Integration

Combines audio and visual modalities to enhance speech recognition performance in noisy environments.

Pre-trained Model

Provides pre-trained models fine-tuned on the MuAViC dataset for quick deployment.

Model Capabilities

Audio-Visual Speech Recognition

Multilingual Speech Recognition

Speech Recognition in Noisy Environments

Use Cases

Speech Recognition

Multilingual Speech Transcription

Convert speech in multiple languages to text

Speech Recognition in Noisy Environments

Perform speech recognition in environments with significant background noise

Improves recognition accuracy by incorporating visual information

library_name: transformers tags: []

Huggingface Implementation of AV-HuBERT on the MuAViC Dataset

This repository contains a Huggingface implementation of the AV-HuBERT (Audio-Visual Hidden Unit BERT) model, specifically trained and tested on the MuAViC (Multilingual Audio-Visual Corpus) dataset. AV-HuBERT is a self-supervised model designed for audio-visual speech recognition, leveraging both audio and visual modalities to achieve robust performance, especially in noisy environments.

Key features of this repository include:

Pre-trained Models: Access pre-trained AV-HuBERT models fine-tuned on the MuAViC dataset. The pre-trained model been exported from MuAViC repository.
Inference scripts: Easily pipelines using Huggingface’s interface.
Data preprocessing scripts: Including normalize frame rate, extract lips and audio.

Inference code

git clone https://github.com/nguyenvulebinh/AV-HuBERT-S2S.git
cd AV-HuBERT-S2S
conda create -n avhuberts2s python=3.9
conda activate avhuberts2s
pip install -r requirements.txt
python run_example.py

from src.model.avhubert2text import AV2TextForConditionalGeneration
from src.dataset.load_data import load_feature
from transformers import Speech2TextTokenizer
import torch

if __name__ == "__main__":
    # Choose language to run example
    AVAILABEL_LANGUAGES = ["ar", "de", "el", "en", "es", "fr", "it", "pt", "ru", "multilingual"]
    language = "ru"
    assert language in AVAILABEL_LANGUAGES, f"Language {language} is not available, please choose one of {AVAILABEL_LANGUAGES}"
    
    
    # Load model and tokenizer
    model_name_or_path = f"nguyenvulebinh/AV-HuBERT-MuAViC-{language}"
    model = AV2TextForConditionalGeneration.from_pretrained(model_name_or_path, cache_dir='./model-bin')
    tokenizer = Speech2TextTokenizer.from_pretrained(model_name_or_path, cache_dir='./model-bin')
    
    model = model.cuda().eval()
    
    # Load example video and audio
    video_example = f"./example/video_processed/{language}_lip_movement.mp4"
    audio_example = f"./example/video_processed/{language}_audio.wav"
    if not os.path.exists(video_example) or not os.path.exists(audio_example):
        print(f"WARNING: Example video and audio for {language} is not available english will be used instead")
        video_example = f"./example/video_processed/en_lip_movement.mp4"
        audio_example = f"./example/video_processed/en_audio.wav"
    
    # Load and process example
    sample = load_feature(
        video_example,
        audio_example
    )
    
    audio_feats = sample['audio_source'].cuda()
    video_feats = sample['video_source'].cuda()
    attention_mask = torch.BoolTensor(audio_feats.size(0), audio_feats.size(-1)).fill_(False).cuda()
    
    # Generate text
    output = model.generate(
        audio_feats,
        attention_mask=attention_mask,
        video=video_feats,
        max_length=1024,
    )

    print(tokenizer.batch_decode(output, skip_special_tokens=True))

Data preprocessing scripts

mkdir model-bin
cd model-bin
wget https://huggingface.co/nguyenvulebinh/AV-HuBERT/resolve/main/20words_mean_face.npy .
wget https://huggingface.co/nguyenvulebinh/AV-HuBERT/resolve/main/shape_predictor_68_face_landmarks.dat .

# raw video only support 4:3 ratio now
cp raw_video.mp4 ./example/ 

python src/dataset/video_to_audio_lips.py

Pretrained AVSR model

Languages	Huggingface
Arabic	Checkpoint-AR
German	Checkpoint-DE
Greek	Checkpoint-EL
English	Checkpoint-EN
Spanish	Checkpoint-ES
French	Checkpoint-FR
Italian	Checkpoint-IT
Portuguese	Checkpoint-PT
Russian	Checkpoint-RU
Multilingual	Checkpoint-ar_de_el_es_fr_it_pt_ru

Acknowledgments

AV-HuBERT: A significant portion of the codebase in this repository has been adapted from the original AV-HuBERT implementation.

MuAViC Repository: We also gratefully acknowledge the creators of the MuAViC dataset and repository for providing the pre-trained models used in this project

License

CC-BY-NC 4.0

Citation

@article{anwar2023muavic,
  title={MuAViC: A Multilingual Audio-Visual Corpus for Robust Speech Recognition and Robust Speech-to-Text Translation},
  author={Anwar, Mohamed and Shi, Bowen and Goswami, Vedanuj and Hsu, Wei-Ning and Pino, Juan and Wang, Changhan},
  journal={arXiv preprint arXiv:2303.00628},
  year={2023}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご