AV-HuBERT-MuAViC-multilingual开源视听语音识别模型

首页

AV HuBERT MuAViC Multilingual

由 nguyenvulebinh 开发

基于MuAViC数据集训练的视听语音识别模型，结合音频和视觉模态提升嘈杂环境下的识别性能

音频生成文本

Transformers

#多模态语音识别 #视听融合 #多语言支持

下载量 165

发布时间 : 3/6/2025

模型简介

AV-HuBERT是一个用于视听语音识别的自监督模型，利用音频和视觉两种模态实现强大的语音识别能力，特别在嘈杂环境中表现优异。

模型特点

多模态融合

同时利用音频和视觉（嘴唇运动）信息进行语音识别

多语言支持

支持包括英语、法语、俄语等9种语言的识别

噪声鲁棒性

在嘈杂环境中仍能保持较高的识别准确率

预训练模型

提供在MuAViC数据集上微调的预训练模型

模型能力

视听语音识别

多语言语音转录

噪声环境语音处理

使用案例

语音识别

会议记录

在嘈杂会议环境中准确记录发言内容

结合视觉信息提高识别准确率

视频字幕生成

为视频内容自动生成字幕

利用嘴唇运动信息提高转录质量

辅助技术

听力辅助

帮助听力障碍者理解语音内容

通过视觉信息补充音频信息

🚀 Huggingface实现的基于MuAViC数据集的AV - HuBERT模型

本仓库包含了基于Huggingface实现的AV - HuBERT（Audio - Visual Hidden Unit BERT）模型，该模型专门在MuAViC（Multilingual Audio - Visual Corpus）数据集上进行了训练和测试。AV - HuBERT是一个用于视听语音识别的自监督模型，它利用音频和视觉两种模态来实现强大的性能，尤其在嘈杂环境中表现出色。

✨ 主要特性

预训练模型：可以获取在MuAViC数据集上微调的预训练AV - HuBERT模型。预训练模型从MuAViC仓库导出。
推理脚本：可使用Huggingface的接口轻松进行推理。
数据预处理脚本：包括归一化帧率、提取嘴唇和音频等操作。

📦 安装指南

按照以下步骤克隆仓库并安装依赖：

git clone https://github.com/nguyenvulebinh/AV-HuBERT-S2S.git
cd AV-HuBERT-S2S
conda create -n avhuberts2s python=3.9
conda activate avhuberts2s
pip install -r requirements.txt

💻 使用示例

基础用法

运行示例代码的命令如下：

python run_example.py

以下是Python代码示例：

from src.model.avhubert2text import AV2TextForConditionalGeneration
from src.dataset.load_data import load_feature
from transformers import Speech2TextTokenizer
import torch

if __name__ == "__main__":
    # Choose language to run example
    AVAILABEL_LANGUAGES = ["ar", "de", "el", "en", "es", "fr", "it", "pt", "ru", "multilingual"]
    language = "ru"
    assert language in AVAILABEL_LANGUAGES, f"Language {language} is not available, please choose one of {AVAILABEL_LANGUAGES}"
    
    
    # Load model and tokenizer
    model_name_or_path = f"nguyenvulebinh/AV-HuBERT-MuAViC-{language}"
    model = AV2TextForConditionalGeneration.from_pretrained(model_name_or_path, cache_dir='./model-bin')
    tokenizer = Speech2TextTokenizer.from_pretrained(model_name_or_path, cache_dir='./model-bin')
    
    model = model.cuda().eval()
    
    # Load example video and audio
    video_example = f"./example/video_processed/{language}_lip_movement.mp4"
    audio_example = f"./example/video_processed/{language}_audio.wav"
    if not os.path.exists(video_example) or not os.path.exists(audio_example):
        print(f"WARNING: Example video and audio for {language} is not available english will be used instead")
        video_example = f"./example/video_processed/en_lip_movement.mp4"
        audio_example = f"./example/video_processed/en_audio.wav"
    
    # Load and process example
    sample = load_feature(
        video_example,
        audio_example
    )
    
    audio_feats = sample['audio_source'].cuda()
    video_feats = sample['video_source'].cuda()
    attention_mask = torch.BoolTensor(audio_feats.size(0), audio_feats.size(-1)).fill_(False).cuda()
    
    # Generate text
    output = model.generate(
        audio_feats,
        attention_mask=attention_mask,
        video=video_feats,
        max_length=1024,
    )

    print(tokenizer.batch_decode(output, skip_special_tokens=True))

数据预处理脚本

运行以下命令进行数据预处理：

mkdir model-bin
cd model-bin
wget https://huggingface.co/nguyenvulebinh/AV-HuBERT/resolve/main/20words_mean_face.npy .
wget https://huggingface.co/nguyenvulebinh/AV-HuBERT/resolve/main/shape_predictor_68_face_landmarks.dat .

# raw video only support 4:3 ratio now
cp raw_video.mp4 ./example/ 

python src/dataset/video_to_audio_lips.py

📚 详细文档

预训练的AVSR模型

以下是不同语言的预训练模型链接：

语言	Huggingface链接
阿拉伯语	Checkpoint - AR
德语	Checkpoint - DE
希腊语	Checkpoint - EL
英语	Checkpoint - EN
西班牙语	Checkpoint - ES
法语	Checkpoint - FR
意大利语	Checkpoint - IT
葡萄牙语	Checkpoint - PT
俄语	Checkpoint - RU
多语言	Checkpoint - ar_de_el_es_fr_it_pt_ru

致谢

AV - HuBERT：本仓库的大部分代码改编自原始的AV - HuBERT实现。
MuAViC仓库：我们感谢MuAViC数据集和仓库的创建者，他们为该项目提供了预训练模型。

📄 许可证

本项目采用CC - BY - NC 4.0许可证。

引用

如果您使用了本项目，请引用以下论文：

@article{anwar2023muavic,
  title={MuAViC: A Multilingual Audio-Visual Corpus for Robust Speech Recognition and Robust Speech-to-Text Translation},
  author={Anwar, Mohamed and Shi, Bowen and Goswami, Vedanuj and Hsu, Wei-Ning and Pino, Juan and Wang, Changhan},
  journal={arXiv preprint arXiv:2303.00628},
  year={2023}
}