首頁

Whisper Tiny.en

由openai開發

Whisper是一個預訓練的自動語音識別(ASR)模型，在68萬小時標註數據上訓練，具有強大的泛化能力。

語音識別英語開源協議:Apache-2.0 #英語語音識別 #零樣本學習 #高魯棒性

下載量 145.30k

發布時間 : 9/26/2022

模型概述

Whisper是基於Transformer的編碼器-解碼器模型，專門用於英語語音識別任務。

模型特點

大規模訓練

在68萬小時的標註語音數據上訓練，展現出強大的泛化能力

無需微調

可以直接應用於多種數據集和領域，無需進行微調

魯棒性

對口音、背景噪音和技術語言表現出較強的魯棒性

模型能力

英語語音識別

長音頻轉錄(通過分塊處理)

使用案例

語音轉錄

會議記錄

將會議錄音自動轉錄為文字記錄

播客轉錄

將英語播客內容轉換為文字

輔助工具

聽力輔助

為聽障人士提供即時語音轉文字服務

語言:

英語標籤:
音頻
自動語音識別
hf-asr排行榜小部件:
示例標題: Librispeech樣本1 來源: https://cdn-media.huggingface.co/speech_samples/sample1.flac
示例標題: Librispeech樣本2 來源: https://cdn-media.huggingface.co/speech_samples/sample2.flac 模型索引:
名稱: whisper-tiny.en 結果:
- 任務: 名稱: 自動語音識別類型: automatic-speech-recognition 數據集: 名稱: LibriSpeech (乾淨) 類型: librispeech_asr 配置: clean 分割: test 參數: 語言: en 指標:
  - 名稱: 測試WER 類型: wer 值: 8.4372112320138
- 任務: 名稱: 自動語音識別類型: automatic-speech-recognition 數據集: 名稱: LibriSpeech (其他) 類型: librispeech_asr 配置: other 分割: test 參數: 語言: en 指標:
  - 名稱: 測試WER 類型: wer 值: 14.857607503498355 流水線標籤: automatic-speech-recognition 許可證: apache-2.0

Whisper

Whisper是一個預訓練的自動語音識別（ASR）和語音翻譯模型。經過68萬小時標註數據的訓練，Whisper模型展現出強大的泛化能力，能在無需微調的情況下適應多種數據集和領域。

Whisper由OpenAI的Alec Radford等人在論文《通過大規模弱監督實現魯棒語音識別》中提出。原始代碼倉庫可在此處找到。

免責聲明: 本模型卡部分內容由Hugging Face團隊撰寫，部分內容複製自原始模型卡。

模型詳情

Whisper是一個基於Transformer的編碼器-解碼器模型，也稱為序列到序列模型。它通過大規模弱監督標註的68萬小時語音數據進行訓練。

模型分為僅英語數據和多語言數據訓練版本。僅英語模型專用於語音識別任務，多語言模型則同時支持語音識別和語音翻譯。語音識別時，模型預測與音頻同語言的文本；語音翻譯時，預測不同語言的文本。

Whisper提供五種不同規模的配置。前四種小規模模型有僅英語和多語言版本，大規模模型僅有多語言版本。所有預訓練模型均可在Hugging Face Hub獲取，如下表所示：

規模	參數量	僅英語版本	多語言版本
tiny	39 M	✓	✓
base	74 M	✓	✓
small	244 M	✓	✓
medium	769 M	✓	✓
large	1550 M	x	✓
large-v2	1550 M	x	✓

使用說明

本檢查點為僅英語模型，適用於英語語音識別。多語言語音識別或翻譯需使用多語言檢查點。

轉錄音頻樣本需配合WhisperProcessor使用，其功能包括：

預處理音頻輸入（轉換為對數梅爾頻譜圖）
後處理模型輸出（將標記轉換為文本）

轉錄示例

>>> from transformers import WhisperProcessor, WhisperForConditionalGeneration
>>> from datasets import load_dataset

>>> # 加載模型和處理器
>>> processor = WhisperProcessor.from_pretrained("openai/whisper-tiny.en")
>>> model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-tiny.en")

>>> # 加載示例數據集
>>> ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
>>> sample = ds[0]["audio"]
>>> input_features = processor(sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt").input_features 

>>> # 生成標記ID
>>> predicted_ids = model.generate(input_features)
>>> # 解碼為文本
>>> transcription = processor.batch_decode(predicted_ids, skip_special_tokens=False)
['<|startoftranscript|><|notimestamps|> Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel.<|endoftext|>']

>>> transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
[' Mr. Quilter is the apostle of the middle classes and we are glad to welcome his gospel.']

設置skip_special_tokens=True可移除轉錄文本開頭的特殊標記。

評估

以下代碼展示如何在LibriSpeech test-clean上評估Whisper tiny.en：

>>> from datasets import load_dataset
>>> from transformers import WhisperForConditionalGeneration, WhisperProcessor
>>> import torch
>>> from evaluate import load

>>> librispeech_test_clean = load_dataset("librispeech_asr", "clean", split="test")

>>> processor = WhisperProcessor.from_pretrained("openai/whisper-tiny.en")
>>> model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-tiny.en").to("cuda")

>>> def map_to_pred(batch):
>>>     audio = batch["audio"]
>>>     input_features = processor(audio["array"], sampling_rate=audio["sampling_rate"], return_tensors="pt").input_features
>>>     batch["reference"] = processor.tokenizer._normalize(batch['text'])
>>>     with torch.no_grad():
>>>         predicted_ids = model.generate(input_features.to("cuda"))[0]
>>>     transcription = processor.decode(predicted_ids)
>>>     batch["prediction"] = processor.tokenizer._normalize(transcription)
>>>     return batch

>>> result = librispeech_test_clean.map(map_to_pred)
>>> wer = load("wer")
>>> print(100 * wer.compute(references=result["reference"], predictions=result["prediction"]))
5.655609406528749

長音頻轉錄

Whisper原生支持最長30秒的音頻樣本。通過分塊算法，可轉錄任意長度的音頻。使用Transformers的pipeline方法並設置chunk_length_s=30即可啟用分塊功能。該功能支持批量推理，還可通過return_timestamps=True獲取時間戳：

>>> import torch
>>> from transformers import pipeline
>>> from datasets import load_dataset

>>> device = "cuda:0" if torch.cuda.is_available() else "cpu"

>>> pipe = pipeline(
>>>   "automatic-speech-recognition",
>>>   model="openai/whisper-tiny.en",
>>>   chunk_length_s=30,
>>>   device=device,
>>> )

>>> ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
>>> sample = ds[0]["audio"]

>>> prediction = pipe(sample.copy(), batch_size=8)["text"]
" Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel."

>>> # 獲取帶時間戳的預測
>>> prediction = pipe(sample.copy(), batch_size=8, return_timestamps=True)["chunks"]
[{'text': ' Mr. Quilter is the apostle of the middle classes and we are glad to welcome his gospel.',
  'timestamp': (0.0, 5.44)}]

詳見博客文章《ASR分塊技術》。

微調

雖然預訓練模型已具備強大泛化能力，但通過微調可進一步提升特定語言和任務的表現。博客文章《用🤗 Transformers微調Whisper》提供了僅需5小時標註數據的微調指南。

使用評估

主要用戶為研究模型魯棒性、泛化能力和侷限性的AI研究人員。同時，Whisper也可作為英語語音識別解決方案供開發者使用。模型在約10種語言的ASR任務中表現優異，但在語音活動檢測、說話人分類等任務需額外評估。

特別提醒：禁止未經同意轉錄個人錄音或用於主觀分類。高風險領域（如決策場景）不建議使用，因為準確性缺陷可能導致嚴重後果。模型設計用途僅為語音轉錄和翻譯，不適用於人類屬性推斷。

訓練數據

模型訓練使用68萬小時互聯網音頻及對應文本。其中65%（43.8萬小時）為英語音頻+文本，18%（12.6萬小時）為非英語音頻+英語文本，17%（11.7萬小時）為非英語音頻+對應語言文本，涵蓋98種語言。如論文所述，特定語言的轉錄性能與訓練數據量直接相關。

性能與侷限

相比現有ASR系統，模型在口音、背景噪聲、專業術語等方面表現更魯棒，零樣本翻譯能力接近最先進水平。但由於採用弱監督訓練，可能出現幻聽文本（即輸出未在音頻中出現的內容）。不同語言表現差異顯著，低資源語言準確率較低，不同口音和方言的表現也存在差異。完整評估結果詳見論文。

序列到序列架構可能導致重複文本生成，雖可通過束搜索和溫度調度緩解，但無法完全消除。低資源語言的幻聽現象可能更嚴重。

社會影響

預期可用於改進無障礙工具。雖然原生不支持即時轉錄，但其速度和規模使開發者能構建近即時應用。性能差異可能帶來實際經濟影響。

同時存在雙重用途風險：一方面可賦能監控技術，自動轉錄大規模音頻；另一方面可能具備說話人識別能力，引發安全和公平性擔憂。實踐中，轉錄成本通常不是監控項目擴展的主要限制因素。

BibTeX引用信息

@misc{radford2022whisper,
  doi = {10.48550/ARXIV.2212.04356},
  url = {https://arxiv.org/abs/2212.04356},
  author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
  title = {Robust Speech Recognition via Large-Scale Weak Supervision},
  publisher = {arXiv},
  year = {2022},
  copyright = {arXiv.org perpetual, non-exclusive license}
}