🚀 孟加拉語自動語音識別模型(Bangla ASR)
本項目是一個基於孟加拉語 Mozilla Common Voice 數據集訓練的自動語音識別(ASR)模型。它通過對 Whisper 模型進行微調,在約 400 小時的孟加拉語語音數據上進行訓練,實現了較低的單詞錯誤率,為孟加拉語語音處理提供了有效的解決方案。
🚀 快速開始
以下是使用該模型進行語音識別的基本步驟:
import os
import librosa
import torch
import torchaudio
import numpy as np
from transformers import WhisperTokenizer
from transformers import WhisperProcessor
from transformers import WhisperFeatureExtractor
from transformers import WhisperForConditionalGeneration
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
mp3_path = "https://huggingface.co/bangla-speech-processing/BanglaASR/resolve/main/mp3/common_voice_bn_31515636.mp3"
model_path = "bangla-speech-processing/BanglaASR"
feature_extractor = WhisperFeatureExtractor.from_pretrained(model_path)
tokenizer = WhisperTokenizer.from_pretrained(model_path)
processor = WhisperProcessor.from_pretrained(model_path)
model = WhisperForConditionalGeneration.from_pretrained(model_path).to(device)
speech_array, sampling_rate = torchaudio.load(mp3_path, format="mp3")
speech_array = speech_array[0].numpy()
speech_array = librosa.resample(np.asarray(speech_array), orig_sr=sampling_rate, target_sr=16000)
input_features = feature_extractor(speech_array, sampling_rate=16000, return_tensors="pt").input_features
predicted_ids = model.generate(inputs=input_features.to(device))[0]
transcription = processor.decode(predicted_ids, skip_special_tokens=True)
print(transcription)
💻 使用示例
基礎用法
上述代碼展示瞭如何使用該模型對單個音頻文件進行語音識別。只需指定音頻文件的路徑和模型的路徑,即可完成語音識別並輸出轉錄結果。
高級用法
在實際應用中,你可能需要處理多個音頻文件或進行批量處理。以下是一個簡單的示例,展示如何處理多個音頻文件:
import os
import librosa
import torch
import torchaudio
import numpy as np
from transformers import WhisperTokenizer
from transformers import WhisperProcessor
from transformers import WhisperFeatureExtractor
from transformers import WhisperForConditionalGeneration
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model_path = "bangla-speech-processing/BanglaASR"
feature_extractor = WhisperFeatureExtractor.from_pretrained(model_path)
tokenizer = WhisperTokenizer.from_pretrained(model_path)
processor = WhisperProcessor.from_pretrained(model_path)
model = WhisperForConditionalGeneration.from_pretrained(model_path).to(device)
audio_files = [
"https://huggingface.co/bangla-speech-processing/BanglaASR/resolve/main/mp3/common_voice_bn_31515636.mp3",
"https://huggingface.co/bangla-speech-processing/BanglaASR/resolve/main/mp3/common_voice_bn_31549899.mp3",
"https://huggingface.co/bangla-speech-processing/BanglaASR/resolve/main/mp3/common_voice_bn_31617644.mp3"
]
for mp3_path in audio_files:
speech_array, sampling_rate = torchaudio.load(mp3_path, format="mp3")
speech_array = speech_array[0].numpy()
speech_array = librosa.resample(np.asarray(speech_array), orig_sr=sampling_rate, target_sr=16000)
input_features = feature_extractor(speech_array, sampling_rate=16000, return_tensors="pt").input_features
predicted_ids = model.generate(inputs=input_features.to(device))[0]
transcription = processor.decode(predicted_ids, skip_special_tokens=True)
print(f"音頻文件: {mp3_path}")
print(f"轉錄結果: {transcription}")
print("-" * 50)
📚 詳細文檔
數據集
本模型使用了 Mozilla Common Voice 數據集,該數據集包含約 400 小時的音頻數據,其中訓練集有 40k 個樣本,驗證集有 7k 個樣本。所有樣本均為 MP3 格式。
如需瞭解更多關於數據集的信息,請點擊此處。
訓練模型信息
模型規模 |
層數 |
寬度 |
頭數 |
參數數量 |
僅支持孟加拉語 |
訓練狀態 |
tiny |
4 |
384 |
6 |
39 M |
否 |
否 |
base |
6 |
512 |
8 |
74 M |
否 |
否 |
small |
12 |
768 |
12 |
244 M |
是 |
是 |
medium |
24 |
1024 |
16 |
769 M |
否 |
否 |
large |
32 |
1280 |
20 |
1550 M |
否 |
否 |
評估
模型的單詞錯誤率(Word Error Rate)為 4.58%。
如需瞭解更多評估信息,請查看GitHub 倉庫。
引用
@misc{BanglaASR ,
title={Transformer Based Whisper Bangla ASR Model},
author={Md Saiful Islam},
howpublished={},
year={2023}
}
📄 許可證
本項目採用 MIT 許可證。