首頁

Whisper Small.en

由openai開發

Whisper是一個預訓練的自動語音識別(ASR)模型，在68萬小時的標註數據上訓練，展現出強大的泛化能力。

語音識別英語開源協議:Apache-2.0 #英語語音識別 #零樣本學習 #長音頻分塊處理

下載量 20.50k

發布時間 : 9/26/2022

模型概述

基於Transformer的編碼器-解碼器模型，專門用於英語語音識別任務，無需微調即可適應多種場景。

模型特點

大規模預訓練

在68萬小時的標註語音數據上訓練，涵蓋多樣化的語音場景

零樣本泛化能力

無需微調即可適應多種數據集和領域

魯棒語音識別

對口音、背景噪音和術語表現出強魯棒性

模型能力

英語語音識別

長音頻轉錄(通過分塊處理)

語音活動檢測

使用案例

無障礙工具

即時字幕生成

為聽力障礙用戶提供即時語音轉文字服務

語音分析

會議記錄轉錄

自動轉錄會議錄音為文字記錄

在LibriSpeech測試集上WER為3.05%

語言:

英語標籤:
音頻
自動語音識別
hf-asr排行榜小部件:
示例標題: Librispeech樣本1 來源: https://cdn-media.huggingface.co/speech_samples/sample1.flac
示例標題: Librispeech樣本2 來源: https://cdn-media.huggingface.co/speech_samples/sample2.flac 模型索引:
名稱: whisper-small.en 結果:
- 任務: 名稱: 自動語音識別類型: automatic-speech-recognition 數據集: 名稱: LibriSpeech (乾淨) 類型: librispeech_asr 配置: clean 拆分: test 參數: 語言: en 指標:
  - 名稱: 測試WER 類型: wer 值:
- 任務: 名稱: 自動語音識別類型: automatic-speech-recognition 數據集: 名稱: LibriSpeech (其他) 類型: librispeech_asr 配置: other 拆分: test 參數: 語言: en 指標:
  - 名稱: 測試WER 類型: wer 值: 管道標籤: automatic-speech-recognition 許可證: apache-2.0

Whisper

Whisper是一個預訓練的自動語音識別(ASR)和語音翻譯模型。在68萬小時的標註數據上訓練後，Whisper模型展現出強大的泛化能力，無需微調即可適應多種數據集和領域。

Whisper由OpenAI的Alec Radford等人在論文通過大規模弱監督實現魯棒語音識別中提出。原始代碼倉庫可在此處找到。

免責聲明: 本模型卡內容部分由Hugging Face團隊撰寫，部分內容複製自原始模型卡。

模型詳情

Whisper是一個基於Transformer的編碼器-解碼器模型，也稱為序列到序列模型。它使用大規模弱監督標註的68萬小時語音數據進行訓練。

模型分為僅英語數據和多語言數據訓練版本。僅英語模型專用於語音識別任務，多語言模型則同時訓練語音識別和語音翻譯能力。語音識別時，模型預測與音頻同語言的文本；語音翻譯時，預測不同語言的文本。

Whisper提供五種不同規模的預訓練配置，其中四種小規模模型有英語和 multilingual 版本，大規模模型僅 multilingual 版本。所有預訓練模型均可在Hugging Face Hub獲取，具體如下表所示：

規模	參數量	僅英語版本	多語言版本
tiny	39 M	✓	✓
base	74 M	✓	✓
small	244 M	✓	✓
medium	769 M	✓	✓
large	1550 M	x	✓
large-v2	1550 M	x	✓

使用方法

當前檢查點為僅英語模型，適用於英語語音識別。如需多語言識別或翻譯，請使用多語言檢查點。

音頻轉錄需配合WhisperProcessor使用，該處理器負責：

音頻預處理（轉換為對數梅爾頻譜）
輸出後處理（將標記轉換為文本）

轉錄示例

>>> from transformers import WhisperProcessor, WhisperForConditionalGeneration
>>> from datasets import load_dataset

>>> # 加載模型和處理器
>>> processor = WhisperProcessor.from_pretrained("openai/whisper-small.en")
>>> model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small.en")

>>> # 加載示例數據集
>>> ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
>>> sample = ds[0]["audio"]
>>> input_features = processor(sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt").input_features 

>>> # 生成標記ID
>>> predicted_ids = model.generate(input_features)
>>> # 解碼為文本
>>> transcription = processor.batch_decode(predicted_ids, skip_special_tokens=False)
['<|startoftranscript|><|notimestamps|> Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel.<|endoftext|>']

>>> transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
[' Mr. Quilter is the apostle of the middle classes and we are glad to welcome his gospel.']

設置skip_special_tokens=True可去除轉錄文本中的特殊上下文標記。

評估

以下代碼展示如何在LibriSpeech test-clean上評估Whisper small.en：

>>> from datasets import load_dataset
>>> from transformers import WhisperForConditionalGeneration, WhisperProcessor
>>> import torch
>>> from evaluate import load

>>> librispeech_test_clean = load_dataset("librispeech_asr", "clean", split="test")

>>> processor = WhisperProcessor.from_pretrained("openai/whisper-small.en")
>>> model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small.en").to("cuda")

>>> def map_to_pred(batch):
>>>     audio = batch["audio"]
>>>     input_features = processor(audio["array"], sampling_rate=audio["sampling_rate"], return_tensors="pt").input_features
>>>     batch["reference"] = processor.tokenizer._normalize(batch['text'])
>>> 
>>>     with torch.no_grad():
>>>         predicted_ids = model.generate(input_features.to("cuda"))[0]
>>>     transcription = processor.decode(predicted_ids)
>>>     batch["prediction"] = processor.tokenizer._normalize(transcription)
>>>     return batch

>>> result = librispeech_test_clean.map(map_to_pred)

>>> wer = load("wer")
>>> print(100 * wer.compute(references=result["reference"], predictions=result["prediction"]))
3.053161596922323

長音頻轉錄

Whisper原生支持30秒以內音頻，但通過分塊算法可處理任意長度音頻。使用Transformers的pipeline並設置chunk_length_s=30即可啟用分塊處理，還支持批量推理和時間戳預測：

>>> import torch
>>> from transformers import pipeline
>>> from datasets import load_dataset

>>> device = "cuda:0" if torch.cuda.is_available() else "cpu"

>>> pipe = pipeline(
>>>   "automatic-speech-recognition",
>>>   model="openai/whisper-small.en",
>>>   chunk_length_s=30,
>>>   device=device,
>>> )

>>> ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
>>> sample = ds[0]["audio"]

>>> prediction = pipe(sample.copy(), batch_size=8)["text"]
" Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel."

>>> # 獲取時間戳
>>> prediction = pipe(sample.copy(), batch_size=8, return_timestamps=True)["chunks"]
[{'text': ' Mr. Quilter is the apostle of the middle classes and we are glad to welcome his gospel.',
  'timestamp': (0.0, 5.44)}]

詳見博客文章ASR分塊處理瞭解分塊算法細節。

微調

雖然預訓練模型已展現強大泛化能力，但通過微調可進一步提升特定語言和任務的表現。博客文章使用🤗 Transformers微調Whisper提供了僅需5小時標註數據的微調指南。

使用評估

主要目標用戶是研究模型魯棒性、泛化能力和侷限性的AI研究人員，同時也適用於英語語音識別開發者。需注意模型發佈後無法限制使用場景。

模型主要評估ASR和英譯任務，在約10種語言上表現良好。雖可能通過微調實現說話人分類等功能，但未經嚴格評估。建議部署前進行領域特定評估。

特別警示：禁止未經同意轉錄個人錄音或用於主觀分類。高風險領域（如決策場景）不建議使用，模型設計用途僅為語音轉錄/翻譯，不適用於屬性推斷。

訓練數據

模型訓練使用68萬小時互聯網採集的音頻及文本，其中：

65%（43.8萬小時）為英語音頻及對應文本
18%（12.6萬小時）為非英語音頻配英文字幕
17%（11.7萬小時）為98種非英語語言數據

如論文所述，特定語言的轉錄性能與其訓練數據量直接相關。

性能與侷限

相比現有ASR系統，模型在口音、噪聲、術語等方面表現更魯棒，零樣本翻譯能力接近SOTA。但由於弱監督訓練特性，可能出現幻聽文本（hallucination），推測是模型在預測音頻內容時結合了語言知識所致。

不同語言表現不均，低資源語言準確率較低。同一語言內不同口音/方言（涉及性別、種族、年齡等）也存在差異，完整評估見論文。

序列到序列架構可能導致重複文本，雖可通過束搜索緩解但無法根除，低資源語言中此類問題可能更顯著。

社會影響

預期可用於提升輔助工具，雖然原生不支持即時轉錄，但其速度與體積為近即時應用開發提供可能。性能差異可能帶來實際經濟影響。

同時需警惕雙刃劍效應：技術普及可能降低監控成本，使大規模音頻監控更易實施。模型可能具備說話人識別能力，帶來安全隱憂。但實踐中轉錄成本並非監控項目擴展的主要限制因素。

BibTeX引用信息

@misc{radford2022whisper,
  doi = {10.48550/ARXIV.2212.04356},
  url = {https://arxiv.org/abs/2212.04356},
  author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
  title = {Robust Speech Recognition via Large-Scale Weak Supervision},
  publisher = {arXiv},
  year = {2022},
  copyright = {arXiv.org perpetual, non-exclusive license}
}