Whisper Small Egyptian Arabic
模型概述
模型特點
模型能力
使用案例
🚀 針對埃及阿拉伯語自動語音識別微調的小型Whisper模型
本項目包含一個針對自動語音識別(ASR)專門微調的openai/whisper-small
模型,目標方言為埃及阿拉伯語。該模型使用SpeechBrain工具包,在MAdel121/arabic-egy-cleaned
數據集上進行了微調。
🚀 快速開始
你可以直接使用transformers
庫的管道進行自動語音識別。確保你已經安裝了transformers
和torch
(pip install transformers torch
)。
from transformers import pipeline
import torch
# 確保你已經安裝了ffmpeg用於音頻處理
# pip install -U ffmpeg-python # 或者通過系統包管理器安裝
device = "cuda:0" if torch.cuda.is_available() else "cpu"
# 將 "your-username/whisper-small-egyptian-arabic" 替換為Hub上的實際模型ID
pipe = pipeline(
"automatic-speech-recognition",
model="your-username/whisper-small-egyptian-arabic", # <<< 替換此處
device=device
)
# 加載你的音頻文件(需要ffmpeg)
# 對於本地文件:
audio_file = "/path/to/your/egyptian_arabic_audio.wav"
result = pipe(audio_file, chunk_length_s=30, batch_size=8) # 根據GPU內存調整batch_size
# 對於datasets庫中的音頻:
# from datasets import load_dataset
# ds = load_dataset("MAdel121/arabic-egy-cleaned", "default", split="test") # 示例
# sample = ds[0]["audio"]
# result = pipe(sample.copy()) # 傳遞副本以避免修改原始數據
print(result["text"])
# --- 使用AutoModelForSpeechSeq2Seq ---
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torchaudio
# 加載處理器和模型(替換為你的模型ID)
model_id = "your-username/whisper-small-egyptian-arabic" # <<< 用你在Hugging Face上的數據集文件替換此處
processor = WhisperProcessor.from_pretrained(model_id)
model = WhisperForConditionalGeneration.from_pretrained(model_id).to(device)
# 加載並預處理音頻
waveform, sample_rate = torchaudio.load(audio_file)
if sample_rate != processor.feature_extractor.sampling_rate:
resampler = torchaudio.transforms.Resample(sample_rate, processor.feature_extractor.sampling_rate)
waveform = resampler(waveform)
input_features = processor(waveform.squeeze().numpy(), sampling_rate=processor.feature_extractor.sampling_rate, return_tensors="pt").input_features.to(device)
# 生成轉錄
# 為阿拉伯語轉錄設置強制解碼器ID
forced_decoder_ids = processor.get_decoder_prompt_ids(language="ar", task="transcribe")
predicted_ids = model.generate(input_features, forced_decoder_ids=forced_decoder_ids)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
print(transcription[0])
⚠️ 重要提示
原始檢查點是使用SpeechBrain保存的。本README假設模型已轉換為標準的Hugging Face Transformers格式,以便使用
pipeline
或AutoModel
類進行託管和使用。如果你使用的是原始的.ckpt
文件,請參考項目的主README.md
和infer_whisper_local.py
腳本獲取加載說明。
✨ 主要特性
- 專門針對埃及阿拉伯語方言進行微調,提升該方言的語音識別效果。
- 使用SpeechBrain工具包進行微調,結合了Hugging Face Transformers和Accelerate框架。
📦 安裝指南
確保你已經安裝了以下依賴:
pip install transformers torch
pip install -U ffmpeg-python
💻 使用示例
基礎用法
from transformers import pipeline
import torch
device = "cuda:0" if torch.cuda.is_available() else "cpu"
pipe = pipeline(
"automatic-speech-recognition",
model="your-username/whisper-small-egyptian-arabic",
device=device
)
audio_file = "/path/to/your/egyptian_arabic_audio.wav"
result = pipe(audio_file, chunk_length_s=30, batch_size=8)
print(result["text"])
高級用法
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torchaudio
model_id = "your-username/whisper-small-egyptian-arabic"
processor = WhisperProcessor.from_pretrained(model_id)
model = WhisperForConditionalGeneration.from_pretrained(model_id).to(device)
waveform, sample_rate = torchaudio.load(audio_file)
if sample_rate != processor.feature_extractor.sampling_rate:
resampler = torchaudio.transforms.Resample(sample_rate, processor.feature_extractor.sampling_rate)
waveform = resampler(waveform)
input_features = processor(waveform.squeeze().numpy(), sampling_rate=processor.feature_extractor.sampling_rate, return_tensors="pt").input_features.to(device)
forced_decoder_ids = processor.get_decoder_prompt_ids(language="ar", task="transcribe")
predicted_ids = model.generate(input_features, forced_decoder_ids=forced_decoder_ids)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
print(transcription[0])
📚 詳細文檔
模型描述
屬性 | 詳情 |
---|---|
基礎模型 | openai/whisper-small |
語言 | 阿拉伯語 (ar ) |
任務 | 轉錄 |
微調框架 | SpeechBrain |
數據集 | MAdel121/arabic-egy-cleaned |
預期用途與限制
本模型旨在轉錄埃及阿拉伯語方言的語音。
限制:
- 在其他阿拉伯語方言上的性能可能會顯著下降。
- 在嘈雜音頻上的性能可能會有所不同,因為訓練期間僅使用了特定的增強技術(DropChunk、DropFreq、DropBitResolution)。
- 模型在高度專業化的領域或微調數據集中未出現的主題上可能表現不佳。
訓練數據
模型在Hugging Face Hub上的**MAdel121/arabic-egy-cleaned
**數據集上進行了微調。該數據集包含埃及阿拉伯語的清理音頻樣本和相應的轉錄。
訓練過程
- 框架:SpeechBrain (
speechbrain==1.0.3
) 結合Hugging Face Transformers (transformers==4.51.3
) 和Accelerate (accelerate==0.25.0
)。 - 基礎模型:
openai/whisper-small
- 數據集:
MAdel121/arabic-egy-cleaned
- 輪數:10
- 優化器:AdamW (
lr=1e-5
,weight_decay=0.05
) - 學習率調度器:NewBob (
improvement_threshold=0.0025
,annealing_factor=0.9
,patient=0
) - 熱身步驟:1000
- 批次大小:8(固定,無動態批處理)
- 梯度累積:2步(有效批次大小:16)
- 梯度裁剪:最大範數5.0
- 混合精度:未明確提及,假設為FP32或由Accelerate/Trainer處理。
- 數據增強:啟用 (
augment_prob_master=0.5
,min_augmentations=1
,max_augmentations=3
),隨機應用以下技術:- DropChunk (
length: 1600 - 4800 samples
,count: 1 - 5
) - DropFreq (
count: 1 - 3
) - DropBitResolution
- DropChunk (
- 訓練環境:Modal Labs (
gpu=A100 - 40GB
)
評估結果
模型在MAdel121/arabic-egy-cleaned
數據集的測試集上進行了評估。
指標 | 值 (%) |
---|---|
單詞錯誤率 (WER) | 22.69 |
字符錯誤率 (CER) | 16.70 |
WER(單詞錯誤率)和CER(字符錯誤率)越低越好。
訓練結束時(第10輪)的驗證指標:
- 驗證WER:22.79%
- 驗證CER:16.76%
引用
如果你使用此模型,請考慮引用原始的Whisper論文和使用的數據集:
@article{radford2023robust,
title={Robust Speech Recognition via Large-Scale Weak Supervision},
author={Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
journal={arXiv preprint arXiv:2212.04356},
year={2023}
}
@misc{adel_mohamed_2024_12860997,
author = {Adel Mohamed},
title = {MAdel121/arabic-egy-cleaned},
month = jun,
year = 2024,
publisher = {Zenodo},
doi = {10.5281/zenodo.12860997},
url = {https://doi.org/10.5281/zenodo.12860997}
}
@misc{speechbrain,
title={{SpeechBrain}: A General-Purpose Speech Toolkit},
author={Mirco Ravanelli and Titouan Parcollet and Peter Plantinga and Aku Rouhe and Samuele Cornell and Loren Lugosch and Cem Subakan and Nauman Dawalatabad and Abdelwahab Heba and Jianyuan Zhong and Ju-Chieh Chou and Sung-Lin Yeh and Szu-Wei Fu and Chien-Feng Liao and Elena Rastorgueva and François Grondin and William Aris and Hwidong Na and Yan Gao and Renato De Mori and Yoshua Bengio},
year={2021},
eprint={2106.04624},
archivePrefix={arXiv},
primaryClass={eess.AS},
note={arXiv:2106.04624}
}
🔧 技術細節
本模型基於openai/whisper-small
進行微調,使用SpeechBrain工具包結合Hugging Face Transformers和Accelerate框架。在訓練過程中,採用了AdamW優化器和NewBob學習率調度器,同時進行了數據增強以提高模型的魯棒性。
📄 許可證
本項目採用MIT許可證。
模型卡片作者
[你的姓名/組織名稱]
(基於訓練運行 ceeu3g6c
)



