🚀 孟加拉语自动语音识别模型(Bangla ASR)
本项目是一个基于孟加拉语 Mozilla Common Voice 数据集训练的自动语音识别(ASR)模型。它通过对 Whisper 模型进行微调,在约 400 小时的孟加拉语语音数据上进行训练,实现了较低的单词错误率,为孟加拉语语音处理提供了有效的解决方案。
🚀 快速开始
以下是使用该模型进行语音识别的基本步骤:
import os
import librosa
import torch
import torchaudio
import numpy as np
from transformers import WhisperTokenizer
from transformers import WhisperProcessor
from transformers import WhisperFeatureExtractor
from transformers import WhisperForConditionalGeneration
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
mp3_path = "https://huggingface.co/bangla-speech-processing/BanglaASR/resolve/main/mp3/common_voice_bn_31515636.mp3"
model_path = "bangla-speech-processing/BanglaASR"
feature_extractor = WhisperFeatureExtractor.from_pretrained(model_path)
tokenizer = WhisperTokenizer.from_pretrained(model_path)
processor = WhisperProcessor.from_pretrained(model_path)
model = WhisperForConditionalGeneration.from_pretrained(model_path).to(device)
speech_array, sampling_rate = torchaudio.load(mp3_path, format="mp3")
speech_array = speech_array[0].numpy()
speech_array = librosa.resample(np.asarray(speech_array), orig_sr=sampling_rate, target_sr=16000)
input_features = feature_extractor(speech_array, sampling_rate=16000, return_tensors="pt").input_features
predicted_ids = model.generate(inputs=input_features.to(device))[0]
transcription = processor.decode(predicted_ids, skip_special_tokens=True)
print(transcription)
💻 使用示例
基础用法
上述代码展示了如何使用该模型对单个音频文件进行语音识别。只需指定音频文件的路径和模型的路径,即可完成语音识别并输出转录结果。
高级用法
在实际应用中,你可能需要处理多个音频文件或进行批量处理。以下是一个简单的示例,展示如何处理多个音频文件:
import os
import librosa
import torch
import torchaudio
import numpy as np
from transformers import WhisperTokenizer
from transformers import WhisperProcessor
from transformers import WhisperFeatureExtractor
from transformers import WhisperForConditionalGeneration
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model_path = "bangla-speech-processing/BanglaASR"
feature_extractor = WhisperFeatureExtractor.from_pretrained(model_path)
tokenizer = WhisperTokenizer.from_pretrained(model_path)
processor = WhisperProcessor.from_pretrained(model_path)
model = WhisperForConditionalGeneration.from_pretrained(model_path).to(device)
audio_files = [
"https://huggingface.co/bangla-speech-processing/BanglaASR/resolve/main/mp3/common_voice_bn_31515636.mp3",
"https://huggingface.co/bangla-speech-processing/BanglaASR/resolve/main/mp3/common_voice_bn_31549899.mp3",
"https://huggingface.co/bangla-speech-processing/BanglaASR/resolve/main/mp3/common_voice_bn_31617644.mp3"
]
for mp3_path in audio_files:
speech_array, sampling_rate = torchaudio.load(mp3_path, format="mp3")
speech_array = speech_array[0].numpy()
speech_array = librosa.resample(np.asarray(speech_array), orig_sr=sampling_rate, target_sr=16000)
input_features = feature_extractor(speech_array, sampling_rate=16000, return_tensors="pt").input_features
predicted_ids = model.generate(inputs=input_features.to(device))[0]
transcription = processor.decode(predicted_ids, skip_special_tokens=True)
print(f"音频文件: {mp3_path}")
print(f"转录结果: {transcription}")
print("-" * 50)
📚 详细文档
数据集
本模型使用了 Mozilla Common Voice 数据集,该数据集包含约 400 小时的音频数据,其中训练集有 40k 个样本,验证集有 7k 个样本。所有样本均为 MP3 格式。
如需了解更多关于数据集的信息,请点击此处。
训练模型信息
模型规模 |
层数 |
宽度 |
头数 |
参数数量 |
仅支持孟加拉语 |
训练状态 |
tiny |
4 |
384 |
6 |
39 M |
否 |
否 |
base |
6 |
512 |
8 |
74 M |
否 |
否 |
small |
12 |
768 |
12 |
244 M |
是 |
是 |
medium |
24 |
1024 |
16 |
769 M |
否 |
否 |
large |
32 |
1280 |
20 |
1550 M |
否 |
否 |
评估
模型的单词错误率(Word Error Rate)为 4.58%。
如需了解更多评估信息,请查看GitHub 仓库。
引用
@misc{BanglaASR ,
title={Transformer Based Whisper Bangla ASR Model},
author={Md Saiful Islam},
howpublished={},
year={2023}
}
📄 许可证
本项目采用 MIT 许可证。