AV-HuBERT: An Open-Source Multilingual Audio-Visual Speech Recognition Model - Combining Audio and Visual Modalities for More Robust Performance

AV HuBERT

Developed by nguyenvulebinh

A multilingual audio-visual speech recognition model based on the MuAViC dataset, combining audio and visual modalities for robust performance

Audio-to-Text

Transformers

#Audio-Visual Speech Recognition #Multimodal Fusion #Multilingual Support

Downloads 683

Release Time : 8/30/2024

Model Overview

AV-HuBERT is a self-supervised model designed for audio-visual speech recognition, achieving robust performance by integrating audio and visual modalities, especially excelling in noisy environments.

Model Features

Multimodal Fusion

Processes both audio and video inputs simultaneously, leveraging lip movement information to enhance speech recognition

Multilingual Support

Supports multiple languages including Arabic, German, Greek, English, Spanish, French, Italian, Portuguese, Russian, and more

Noise Robustness

Improves recognition accuracy in noisy environments by supplementing audio signals with visual information

Model Capabilities

Audio-Visual Speech Recognition

Multilingual Speech-to-Text

Noise Environment Speech Processing

Use Cases

Speech Recognition

Meeting Transcription

Automatically generates transcripts during video conferences

Improves recognition accuracy in noisy environments

Accessibility Applications

Provides real-time captioning services for the hearing impaired

Enhances comprehension by incorporating lip movement information

Education

Language Learning

Helps learners improve pronunciation by observing lip movements

Provides more accurate pronunciation feedback

🚀 Huggingface Implementation of AV-HuBERT on the MuAViC Dataset

This repository offers a Huggingface implementation of the AV-HuBERT (Audio-Visual Hidden Unit BERT) model. It's specifically trained and tested on the MuAViC (Multilingual Audio-Visual Corpus) dataset. AV-HuBERT is a self - supervised model for audio - visual speech recognition. It uses both audio and visual modalities to achieve strong performance, especially in noisy environments.

✨ Features

Pre - trained Models: You can access pre - trained AV - HuBERT models fine - tuned on the MuAViC dataset. These pre - trained models are exported from the MuAViC repository.
Inference scripts: Easily set up pipelines using Huggingface’s interface.
Data preprocessing scripts: These include normalizing frame rates, extracting lips, and audio.

📦 Installation

First, clone the repository and set up the environment:

git clone https://github.com/nguyenvulebinh/AV-HuBERT-S2S.git
cd AV-HuBERT-S2S
conda create -n avhuberts2s python=3.9
conda activate avhuberts2s
pip install -r requirements.txt

💻 Usage Examples

Basic Usage

Here is the code to run an example:

python run_example.py

from src.model.avhubert2text import AV2TextForConditionalGeneration
from src.dataset.load_data import load_feature
from transformers import Speech2TextTokenizer
import torch

if __name__ == "__main__":
    # Choose language to run example
    AVAILABEL_LANGUAGES = ["ar", "de", "el", "en", "es", "fr", "it", "pt", "ru", "multilingual"]
    language = "ru"
    assert language in AVAILABEL_LANGUAGES, f"Language {language} is not available, please choose one of {AVAILABEL_LANGUAGES}"
    
    
    # Load model and tokenizer
    model_name_or_path = f"nguyenvulebinh/AV-HuBERT-MuAViC-{language}"
    model = AV2TextForConditionalGeneration.from_pretrained(model_name_or_path, cache_dir='./model-bin')
    tokenizer = Speech2TextTokenizer.from_pretrained(model_name_or_path, cache_dir='./model-bin')
    
    model = model.cuda().eval()
    
    # Load example video and audio
    video_example = f"./example/video_processed/{language}_lip_movement.mp4"
    audio_example = f"./example/video_processed/{language}_audio.wav"
    if not os.path.exists(video_example) or not os.path.exists(audio_example):
        print(f"WARNING: Example video and audio for {language} is not available english will be used instead")
        video_example = f"./example/video_processed/en_lip_movement.mp4"
        audio_example = f"./example/video_processed/en_audio.wav"
    
    # Load and process example
    sample = load_feature(
        video_example,
        audio_example
    )
    
    audio_feats = sample['audio_source'].cuda()
    video_feats = sample['video_source'].cuda()
    attention_mask = torch.BoolTensor(audio_feats.size(0), audio_feats.size(-1)).fill_(False).cuda()
    
    # Generate text
    output = model.generate(
        audio_feats,
        attention_mask=attention_mask,
        video=video_feats,
        max_length=1024,
    )

    print(tokenizer.batch_decode(output, skip_special_tokens=True))

Advanced Usage - Data Preprocessing

mkdir model-bin
cd model-bin
wget https://huggingface.co/nguyenvulebinh/AV-HuBERT/resolve/main/20words_mean_face.npy .
wget https://huggingface.co/nguyenvulebinh/AV-HuBERT/resolve/main/shape_predictor_68_face_landmarks.dat .

# raw video only support 4:3 ratio now
cp raw_video.mp4 ./example/ 

python src/dataset/video_to_audio_lips.py

📚 Documentation

Pretrained AVSR Model

Languages	Huggingface
Arabic	Checkpoint - AR
German	Checkpoint - DE
Greek	Checkpoint - EL
English	Checkpoint - EN
Spanish	Checkpoint - ES
French	Checkpoint - FR
Italian	Checkpoint - IT
Portuguese	Checkpoint - PT
Russian	Checkpoint - RU
Multilingual	Checkpoint - ar_de_el_es_fr_it_pt_ru

📄 License

This project is licensed under CC - BY - NC 4.0.

📚 Acknowledgments

AV - HuBERT: A large part of the codebase in this repository is adapted from the original AV - HuBERT implementation.
MuAViC Repository: We thank the creators of the MuAViC dataset and repository for providing the pre - trained models used in this project.

📖 Citation

@article{anwar2023muavic,
  title={MuAViC: A Multilingual Audio-Visual Corpus for Robust Speech Recognition and Robust Speech-to-Text Translation},
  author={Anwar, Mohamed and Shi, Bowen and Goswami, Vedanuj and Hsu, Wei-Ning and Pino, Juan and Wang, Changhan},
  journal={arXiv preprint arXiv:2303.00628},
  year={2023}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご