模型概述

Whisper模型用於語音識別和翻譯，適應多種語言和領域，無需微調即可應用。

模型特點

強大的泛化能力

模型在不同數據集和領域中表現出色，無需微調即可適應多種任務。

多種規模的預訓練配置

提供多種規模的模型配置，以滿足不同的應用需求。

支持音頻分塊處理

可處理最長30秒的音頻，通過分塊算法支持任意長度音頻的轉錄。

模型能力

自動語音識別

語音翻譯

使用案例

語音轉錄

會議記錄

將會議音頻轉錄為文本，便於後續查閱和分享。

減少了記錄時間，提高了信息獲取效率。

語音助手

為語音助手提供準確的語音識別功能。

增強了用戶體驗，提升了語音助手的響應能力。

語音翻譯

多語言會議翻譯

即時翻譯多個語言的會議音頻。

使與會者能夠輕鬆理解不同語言的發言。

語言:

英語標籤:
音頻
自動語音識別
hf-asr排行榜小部件示例:
標題: Librispeech樣本1 音頻源: https://cdn-media.huggingface.co/speech_samples/sample1.flac
標題: Librispeech樣本2 音頻源: https://cdn-media.huggingface.co/speech_samples/sample2.flac 模型索引:
名稱: whisper-base.en 性能結果:
- 任務: 名稱: 自動語音識別類型: automatic-speech-recognition 數據集: 名稱: LibriSpeech(清晰版) 類型: librispeech_asr 配置: clean 拆分: test 參數: 語言: en 指標:
  - 名稱: 測試WER 類型: wer 值:
- 任務: 名稱: 自動語音識別類型: automatic-speech-recognition 數據集: 名稱: LibriSpeech(其他版) 類型: librispeech_asr 配置: other 拆分: test 參數: 語言: en 指標:
  - 名稱: 測試WER 類型: wer 值: 12.803978669490565 管道標籤: automatic-speech-recognition 許可證: apache-2.0

Whisper語音模型

Whisper是一個預訓練的自動語音識別(ASR)和語音翻譯模型。通過68萬小時標註數據的訓練，Whisper系列模型展現出強大的泛化能力，能在無需微調的情況下適應多種數據集和領域。

該模型由OpenAI的Alec Radford等人在論文《通過大規模弱監督實現魯棒語音識別》中提出，原始代碼庫參見GitHub倉庫。

免責聲明：本模型卡部分內容由Hugging Face團隊編寫，部分內容複製自原始模型卡。

模型詳情

Whisper是基於Transformer的編碼器-解碼器結構（序列到序列模型），使用大規模弱監督標註的68萬小時語音數據訓練而成。模型分為英語專用版和多語言版——英語版僅支持語音識別，多語言版同時支持語音識別和跨語言翻譯。

模型提供五種規格的預訓練權重，前四種包含英語專用和多語言版本，最大規格僅提供多語言版。所有預訓練模型均可在Hugging Face Hub獲取：

規格	參數量	英語專用版鏈接	多語言版鏈接
tiny	39 M	✓	✓
base	74 M	✓	✓
small	244 M	✓	✓
medium	769 M	✓	✓
large	1550 M	x	✓
large-v2	1550 M	x	✓

使用指南

當前模型為英語專用版，需配合WhisperProcessor使用，該處理器負責：

音頻預處理（轉為對數梅爾頻譜）
輸出後處理（將標記轉換為文本）

語音轉錄示例

>>> from transformers import WhisperProcessor, WhisperForConditionalGeneration
>>> from datasets import load_dataset

>>> # 加載模型和處理器
>>> processor = WhisperProcessor.from_pretrained("openai/whisper-base.en")
>>> model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-base.en")

>>> # 處理示例音頻
>>> ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
>>> sample = ds[0]["audio"]
>>> input_features = processor(sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt").input_features 

>>> # 生成文本標記
>>> predicted_ids = model.generate(input_features)
>>> # 解碼為文字
>>> transcription = processor.batch_decode(predicted_ids, skip_special_tokens=False)
['<|startoftranscript|><|notimestamps|> Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel.<|endoftext|>']

>>> transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
[' Mr. Quilter is the apostle of the middle classes and we are glad to welcome his gospel.']

長音頻處理

通過設置chunk_length_s=30啟用分塊處理，可轉錄任意長度音頻。結合return_timestamps=True還能獲取時間戳：

>>> pipe = pipeline(
>>>   "automatic-speech-recognition",
>>>   model="openai/whisper-base.en",
>>>   chunk_length_s=30,
>>>   device="cuda",
>>> )
>>> prediction = pipe(sample, batch_size=8, return_timestamps=True)["chunks"]
[{'text': ' Mr. Quilter is the apostle...', 'timestamp': (0.0, 5.44)}]

訓練數據

模型使用68萬小時互聯網採集的語音數據訓練，其中：

65%（43.8萬小時）為英語語音及文本
18%（12.6萬小時）為非英語語音配英文字幕
17%（11.7萬小時）涵蓋98種語言的語音及對應文本

侷限性與注意事項

可能產生音頻中不存在的幻覺文本
低資源語言準確率較低
不同口音/方言的表現存在差異
序列結構可能導致文本重複
未經同意錄製個人語音存在倫理風險
不建議用於高風險決策場景

詳細技術細節請參閱原始論文。

引用

@misc{radford2022whisper,
  doi = {10.48550/ARXIV.2212.04356},
  url = {https://arxiv.org/abs/2212.04356},
  author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
  title = {Robust Speech Recognition via Large-Scale Weak Supervision},
  publisher = {arXiv},
  year = {2022},
  copyright = {arXiv.org perpetual, non-exclusive license}
}