AV-HuBERT-MuAViC-multilingual開源視聽語音識別模型

首頁

AV HuBERT MuAViC Multilingual

由nguyenvulebinh開發

基於MuAViC數據集訓練的視聽語音識別模型，結合音頻和視覺模態提升嘈雜環境下的識別性能

音頻生成文本

Transformers

#多模態語音識別 #視聽融合 #多語言支持

下載量 165

發布時間 : 3/6/2025

模型概述

AV-HuBERT是一個用於視聽語音識別的自監督模型，利用音頻和視覺兩種模態實現強大的語音識別能力，特別在嘈雜環境中表現優異。

模型特點

多模態融合

同時利用音頻和視覺（嘴唇運動）信息進行語音識別

多語言支持

支持包括英語、法語、俄語等9種語言的識別

噪聲魯棒性

在嘈雜環境中仍能保持較高的識別準確率

預訓練模型

提供在MuAViC數據集上微調的預訓練模型

模型能力

視聽語音識別

多語言語音轉錄

噪聲環境語音處理

使用案例

語音識別

會議記錄

在嘈雜會議環境中準確記錄發言內容

結合視覺信息提高識別準確率

視頻字幕生成

為視頻內容自動生成字幕

利用嘴唇運動信息提高轉錄質量

輔助技術

聽力輔助

幫助聽力障礙者理解語音內容

通過視覺信息補充音頻信息

🚀 Huggingface實現的基於MuAViC數據集的AV - HuBERT模型

本倉庫包含了基於Huggingface實現的AV - HuBERT（Audio - Visual Hidden Unit BERT）模型，該模型專門在MuAViC（Multilingual Audio - Visual Corpus）數據集上進行了訓練和測試。AV - HuBERT是一個用於視聽語音識別的自監督模型，它利用音頻和視覺兩種模態來實現強大的性能，尤其在嘈雜環境中表現出色。

✨ 主要特性

預訓練模型：可以獲取在MuAViC數據集上微調的預訓練AV - HuBERT模型。預訓練模型從MuAViC倉庫導出。
推理腳本：可使用Huggingface的接口輕鬆進行推理。
數據預處理腳本：包括歸一化幀率、提取嘴唇和音頻等操作。

📦 安裝指南

按照以下步驟克隆倉庫並安裝依賴：

git clone https://github.com/nguyenvulebinh/AV-HuBERT-S2S.git
cd AV-HuBERT-S2S
conda create -n avhuberts2s python=3.9
conda activate avhuberts2s
pip install -r requirements.txt

💻 使用示例

基礎用法

運行示例代碼的命令如下：

python run_example.py

以下是Python代碼示例：

from src.model.avhubert2text import AV2TextForConditionalGeneration
from src.dataset.load_data import load_feature
from transformers import Speech2TextTokenizer
import torch

if __name__ == "__main__":
    # Choose language to run example
    AVAILABEL_LANGUAGES = ["ar", "de", "el", "en", "es", "fr", "it", "pt", "ru", "multilingual"]
    language = "ru"
    assert language in AVAILABEL_LANGUAGES, f"Language {language} is not available, please choose one of {AVAILABEL_LANGUAGES}"
    
    
    # Load model and tokenizer
    model_name_or_path = f"nguyenvulebinh/AV-HuBERT-MuAViC-{language}"
    model = AV2TextForConditionalGeneration.from_pretrained(model_name_or_path, cache_dir='./model-bin')
    tokenizer = Speech2TextTokenizer.from_pretrained(model_name_or_path, cache_dir='./model-bin')
    
    model = model.cuda().eval()
    
    # Load example video and audio
    video_example = f"./example/video_processed/{language}_lip_movement.mp4"
    audio_example = f"./example/video_processed/{language}_audio.wav"
    if not os.path.exists(video_example) or not os.path.exists(audio_example):
        print(f"WARNING: Example video and audio for {language} is not available english will be used instead")
        video_example = f"./example/video_processed/en_lip_movement.mp4"
        audio_example = f"./example/video_processed/en_audio.wav"
    
    # Load and process example
    sample = load_feature(
        video_example,
        audio_example
    )
    
    audio_feats = sample['audio_source'].cuda()
    video_feats = sample['video_source'].cuda()
    attention_mask = torch.BoolTensor(audio_feats.size(0), audio_feats.size(-1)).fill_(False).cuda()
    
    # Generate text
    output = model.generate(
        audio_feats,
        attention_mask=attention_mask,
        video=video_feats,
        max_length=1024,
    )

    print(tokenizer.batch_decode(output, skip_special_tokens=True))

數據預處理腳本

運行以下命令進行數據預處理：

mkdir model-bin
cd model-bin
wget https://huggingface.co/nguyenvulebinh/AV-HuBERT/resolve/main/20words_mean_face.npy .
wget https://huggingface.co/nguyenvulebinh/AV-HuBERT/resolve/main/shape_predictor_68_face_landmarks.dat .

# raw video only support 4:3 ratio now
cp raw_video.mp4 ./example/ 

python src/dataset/video_to_audio_lips.py

📚 詳細文檔

預訓練的AVSR模型

以下是不同語言的預訓練模型鏈接：

語言	Huggingface鏈接
阿拉伯語	Checkpoint - AR
德語	Checkpoint - DE
希臘語	Checkpoint - EL
英語	Checkpoint - EN
西班牙語	Checkpoint - ES
法語	Checkpoint - FR
意大利語	Checkpoint - IT
葡萄牙語	Checkpoint - PT
俄語	Checkpoint - RU
多語言	Checkpoint - ar_de_el_es_fr_it_pt_ru

致謝

AV - HuBERT：本倉庫的大部分代碼改編自原始的AV - HuBERT實現。
MuAViC倉庫：我們感謝MuAViC數據集和倉庫的創建者，他們為該項目提供了預訓練模型。

📄 許可證

本項目採用CC - BY - NC 4.0許可證。

引用

如果您使用了本項目，請引用以下論文：

@article{anwar2023muavic,
  title={MuAViC: A Multilingual Audio-Visual Corpus for Robust Speech Recognition and Robust Speech-to-Text Translation},
  author={Anwar, Mohamed and Shi, Bowen and Goswami, Vedanuj and Hsu, Wei-Ning and Pino, Juan and Wang, Changhan},
  journal={arXiv preprint arXiv:2303.00628},
  year={2023}
}