iwslt - asr - wav2vec - large - 4500h開源英語語音識別模型，解碼準確，助您高效處理語音

首頁

Iwslt Asr Wav2vec Large 4500h

由nguyenvulebinh開發

基於Wav2Vec2架構的大規模英語自動語音識別模型，在4500小時多源語音數據上微調，支持帶語言模型的解碼

語音識別

Transformers

英語#多數據集訓練 #高精度語音識別 #支持語言模型

下載量 27

發布時間 : 3/23/2022

模型概述

該模型是基於Facebook的Wav2Vec2架構微調的英語自動語音識別系統，整合了語言模型以提高轉錄準確率，適用於多種英語口音的語音轉文本任務

模型特點

多源數據訓練

在7個不同來源的語音數據集上訓練，總時長超過4500小時

語言模型集成

提供帶語言模型的處理器，顯著降低詞錯誤率

高性能轉錄

在自由語音測試集上達到1.1%的詞錯誤率（帶語言模型）

模型能力

英語語音識別

帶語言模型的語音解碼

多口音英語處理

使用案例

語音轉錄

會議記錄

將英語會議錄音自動轉為文字記錄

在自由語音測試集上詞錯誤率僅1.1%

教育內容轉錄

將英語教學視頻/音頻轉為文字

在TED演講數據上詞錯誤率5.4%

🚀 微調Wav2Vec2大型模型用於英文自動語音識別

本項目聚焦於微調Wav2Vec2大型模型，以實現英文自動語音識別（ASR）。通過使用多個公開數據集進行微調，並展示了評估結果，同時提供了模型的使用示例和許可信息。

🚀 快速開始

你可以點擊下面的按鈕在Colab中運行示例代碼：

from transformers.file_utils import cached_path, hf_bucket_url
from importlib.machinery import SourceFileLoader
from transformers import Wav2Vec2ProcessorWithLM
from IPython.lib.display import Audio
import torchaudio
import torch

# 加載模型和處理器
model_name = "nguyenvulebinh/iwslt-asr-wav2vec-large-4500h"
model = SourceFileLoader("model", cached_path(hf_bucket_url(model_name,filename="model_handling.py"))).load_module().Wav2Vec2ForCTC.from_pretrained(model_name)
processor = Wav2Vec2ProcessorWithLM.from_pretrained(model_name)

# 加載示例音頻（16k）
audio, sample_rate = torchaudio.load(cached_path(hf_bucket_url(model_name, filename="tst_2010_sample.wav")))
input_data = processor.feature_extractor(audio[0], sampling_rate=16000, return_tensors='pt')

# 推理
output = model(**input_data)

# 輸出無語言模型的轉錄結果
print(processor.tokenizer.decode(output.logits.argmax(dim=-1)[0].detach().cpu().numpy()))
# and of course there's teams that have a lot more tada structures and among the best are recent graduates of kindergarten

# 輸出有語言模型的轉錄結果
print(processor.decode(output.logits.cpu().detach().numpy()[0], beam_width=100).text)
# and of course there are teams that have a lot more ta da structures and among the best are recent graduates of kindergarten

✨ 主要特性

多數據集微調：使用多個公開數據集（如Common Voice、Librispeech等）對Wav2Vec2大型模型進行微調，提升英文ASR性能。
評估結果展示：提供了在Librispeech和Tedlium數據集上的評估結果，包括字錯率（WER）。
代碼示例：提供了完整的使用示例代碼，方便用戶快速上手。

📦 安裝指南

文檔未提供具體安裝步驟，暫不展示。

💻 使用示例

基礎用法

from transformers.file_utils import cached_path, hf_bucket_url
from importlib.machinery import SourceFileLoader
from transformers import Wav2Vec2ProcessorWithLM
from IPython.lib.display import Audio
import torchaudio
import torch

# 加載模型和處理器
model_name = "nguyenvulebinh/iwslt-asr-wav2vec-large-4500h"
model = SourceFileLoader("model", cached_path(hf_bucket_url(model_name,filename="model_handling.py"))).load_module().Wav2Vec2ForCTC.from_pretrained(model_name)
processor = Wav2Vec2ProcessorWithLM.from_pretrained(model_name)

# 加載示例音頻（16k）
audio, sample_rate = torchaudio.load(cached_path(hf_bucket_url(model_name, filename="tst_2010_sample.wav")))
input_data = processor.feature_extractor(audio[0], sampling_rate=16000, return_tensors='pt')

# 推理
output = model(**input_data)

# 輸出無語言模型的轉錄結果
print(processor.tokenizer.decode(output.logits.argmax(dim=-1)[0].detach().cpu().numpy()))
# and of course there's teams that have a lot more tada structures and among the best are recent graduates of kindergarten

# 輸出有語言模型的轉錄結果
print(processor.decode(output.logits.cpu().detach().numpy()[0], beam_width=100).text)
# and of course there are teams that have a lot more ta da structures and among the best are recent graduates of kindergarten