🚀 Huggingface实现的基于MuAViC数据集的AV - HuBERT模型
本仓库包含了基于Huggingface实现的AV - HuBERT(Audio - Visual Hidden Unit BERT)模型,该模型专门在MuAViC(Multilingual Audio - Visual Corpus)数据集上进行了训练和测试。AV - HuBERT是一个用于视听语音识别的自监督模型,它利用音频和视觉两种模态来实现强大的性能,尤其在嘈杂环境中表现出色。
✨ 主要特性
- 预训练模型:可以获取在MuAViC数据集上微调的预训练AV - HuBERT模型。预训练模型从MuAViC仓库导出。
- 推理脚本:可使用Huggingface的接口轻松进行推理。
- 数据预处理脚本:包括归一化帧率、提取嘴唇和音频等操作。
📦 安装指南
按照以下步骤克隆仓库并安装依赖:
git clone https://github.com/nguyenvulebinh/AV-HuBERT-S2S.git
cd AV-HuBERT-S2S
conda create -n avhuberts2s python=3.9
conda activate avhuberts2s
pip install -r requirements.txt
💻 使用示例
基础用法
运行示例代码的命令如下:
python run_example.py
以下是Python代码示例:
from src.model.avhubert2text import AV2TextForConditionalGeneration
from src.dataset.load_data import load_feature
from transformers import Speech2TextTokenizer
import torch
if __name__ == "__main__":
AVAILABEL_LANGUAGES = ["ar", "de", "el", "en", "es", "fr", "it", "pt", "ru", "multilingual"]
language = "ru"
assert language in AVAILABEL_LANGUAGES, f"Language {language} is not available, please choose one of {AVAILABEL_LANGUAGES}"
model_name_or_path = f"nguyenvulebinh/AV-HuBERT-MuAViC-{language}"
model = AV2TextForConditionalGeneration.from_pretrained(model_name_or_path, cache_dir='./model-bin')
tokenizer = Speech2TextTokenizer.from_pretrained(model_name_or_path, cache_dir='./model-bin')
model = model.cuda().eval()
video_example = f"./example/video_processed/{language}_lip_movement.mp4"
audio_example = f"./example/video_processed/{language}_audio.wav"
if not os.path.exists(video_example) or not os.path.exists(audio_example):
print(f"WARNING: Example video and audio for {language} is not available english will be used instead")
video_example = f"./example/video_processed/en_lip_movement.mp4"
audio_example = f"./example/video_processed/en_audio.wav"
sample = load_feature(
video_example,
audio_example
)
audio_feats = sample['audio_source'].cuda()
video_feats = sample['video_source'].cuda()
attention_mask = torch.BoolTensor(audio_feats.size(0), audio_feats.size(-1)).fill_(False).cuda()
output = model.generate(
audio_feats,
attention_mask=attention_mask,
video=video_feats,
max_length=1024,
)
print(tokenizer.batch_decode(output, skip_special_tokens=True))
数据预处理脚本
运行以下命令进行数据预处理:
mkdir model-bin
cd model-bin
wget https://huggingface.co/nguyenvulebinh/AV-HuBERT/resolve/main/20words_mean_face.npy .
wget https://huggingface.co/nguyenvulebinh/AV-HuBERT/resolve/main/shape_predictor_68_face_landmarks.dat .
cp raw_video.mp4 ./example/
python src/dataset/video_to_audio_lips.py
📚 详细文档
预训练的AVSR模型
以下是不同语言的预训练模型链接:
致谢
- AV - HuBERT:本仓库的大部分代码改编自原始的AV - HuBERT实现。
- MuAViC仓库:我们感谢MuAViC数据集和仓库的创建者,他们为该项目提供了预训练模型。
📄 许可证
本项目采用CC - BY - NC 4.0许可证。
引用
如果您使用了本项目,请引用以下论文:
@article{anwar2023muavic,
title={MuAViC: A Multilingual Audio-Visual Corpus for Robust Speech Recognition and Robust Speech-to-Text Translation},
author={Anwar, Mohamed and Shi, Bowen and Goswami, Vedanuj and Hsu, Wei-Ning and Pino, Juan and Wang, Changhan},
journal={arXiv preprint arXiv:2303.00628},
year={2023}
}