iwslt - asr - wav2vec - large - 4500h开源英语语音识别模型，解码准确，助您高效处理语音

首页

Iwslt Asr Wav2vec Large 4500h

由 nguyenvulebinh 开发

基于Wav2Vec2架构的大规模英语自动语音识别模型，在4500小时多源语音数据上微调，支持带语言模型的解码

语音识别

Transformers

英语#多数据集训练 #高精度语音识别 #支持语言模型

下载量 27

发布时间 : 3/23/2022

模型简介

该模型是基于Facebook的Wav2Vec2架构微调的英语自动语音识别系统，整合了语言模型以提高转录准确率，适用于多种英语口音的语音转文本任务

模型特点

多源数据训练

在7个不同来源的语音数据集上训练，总时长超过4500小时

语言模型集成

提供带语言模型的处理器，显著降低词错误率

高性能转录

在自由语音测试集上达到1.1%的词错误率（带语言模型）

模型能力

英语语音识别

带语言模型的语音解码

多口音英语处理

使用案例

语音转录

会议记录

将英语会议录音自动转为文字记录

在自由语音测试集上词错误率仅1.1%

教育内容转录

将英语教学视频/音频转为文字

在TED演讲数据上词错误率5.4%

🚀 微调Wav2Vec2大型模型用于英文自动语音识别

本项目聚焦于微调Wav2Vec2大型模型，以实现英文自动语音识别（ASR）。通过使用多个公开数据集进行微调，并展示了评估结果，同时提供了模型的使用示例和许可信息。

🚀 快速开始

你可以点击下面的按钮在Colab中运行示例代码：

from transformers.file_utils import cached_path, hf_bucket_url
from importlib.machinery import SourceFileLoader
from transformers import Wav2Vec2ProcessorWithLM
from IPython.lib.display import Audio
import torchaudio
import torch

# 加载模型和处理器
model_name = "nguyenvulebinh/iwslt-asr-wav2vec-large-4500h"
model = SourceFileLoader("model", cached_path(hf_bucket_url(model_name,filename="model_handling.py"))).load_module().Wav2Vec2ForCTC.from_pretrained(model_name)
processor = Wav2Vec2ProcessorWithLM.from_pretrained(model_name)

# 加载示例音频（16k）
audio, sample_rate = torchaudio.load(cached_path(hf_bucket_url(model_name, filename="tst_2010_sample.wav")))
input_data = processor.feature_extractor(audio[0], sampling_rate=16000, return_tensors='pt')

# 推理
output = model(**input_data)

# 输出无语言模型的转录结果
print(processor.tokenizer.decode(output.logits.argmax(dim=-1)[0].detach().cpu().numpy()))
# and of course there's teams that have a lot more tada structures and among the best are recent graduates of kindergarten

# 输出有语言模型的转录结果
print(processor.decode(output.logits.cpu().detach().numpy()[0], beam_width=100).text)
# and of course there are teams that have a lot more ta da structures and among the best are recent graduates of kindergarten

✨ 主要特性

多数据集微调：使用多个公开数据集（如Common Voice、Librispeech等）对Wav2Vec2大型模型进行微调，提升英文ASR性能。
评估结果展示：提供了在Librispeech和Tedlium数据集上的评估结果，包括字错率（WER）。
代码示例：提供了完整的使用示例代码，方便用户快速上手。

📦 安装指南

文档未提供具体安装步骤，暂不展示。

💻 使用示例

基础用法

from transformers.file_utils import cached_path, hf_bucket_url
from importlib.machinery import SourceFileLoader
from transformers import Wav2Vec2ProcessorWithLM
from IPython.lib.display import Audio
import torchaudio
import torch

# 加载模型和处理器
model_name = "nguyenvulebinh/iwslt-asr-wav2vec-large-4500h"
model = SourceFileLoader("model", cached_path(hf_bucket_url(model_name,filename="model_handling.py"))).load_module().Wav2Vec2ForCTC.from_pretrained(model_name)
processor = Wav2Vec2ProcessorWithLM.from_pretrained(model_name)

# 加载示例音频（16k）
audio, sample_rate = torchaudio.load(cached_path(hf_bucket_url(model_name, filename="tst_2010_sample.wav")))
input_data = processor.feature_extractor(audio[0], sampling_rate=16000, return_tensors='pt')

# 推理
output = model(**input_data)

# 输出无语言模型的转录结果
print(processor.tokenizer.decode(output.logits.argmax(dim=-1)[0].detach().cpu().numpy()))
# and of course there's teams that have a lot more tada structures and among the best are recent graduates of kindergarten

# 输出有语言模型的转录结果
print(processor.decode(output.logits.cpu().detach().numpy()[0], beam_width=100).text)
# and of course there are teams that have a lot more ta da structures and among the best are recent graduates of kindergarten