Whisper-base開源語音模型 - 免費使用精準實現語音識別與翻譯

首頁

Whisper Base

由openai開發

Whisper是一個預訓練的自動語音識別(ASR)和語音翻譯模型，經過68萬小時標註數據訓練，具有強大的泛化能力。

語音識別支持多種語言開源協議:Apache-2.0 #多語言語音識別 #零樣本翻譯 #大規模弱監督

下載量 491.35k

發布時間 : 9/26/2022

模型概述

Whisper是基於Transformer的編碼器-解碼器模型，支持多種語言的語音識別和翻譯任務，無需微調即可適應不同數據集和領域。

模型特點

大規模預訓練

使用68萬小時標註語音數據訓練，具有強大的泛化能力

多語言支持

支持99種語言的語音識別和翻譯任務

零樣本學習

無需微調即可適應不同數據集和領域

多功能任務

同時支持語音識別和語音翻譯兩種任務模式

模型能力

英語語音識別

多語言語音識別

跨語言語音翻譯

音頻轉錄

語音轉文本

使用案例

語音轉錄

會議記錄

將會議錄音自動轉錄為文字記錄

在LibriSpeech清晰測試集上WER為5.01

播客轉錄

將播客內容轉換為可搜索的文本

語音翻譯

即時翻譯

將一種語言的語音即時翻譯為另一種語言的文本

🚀 語音識別模型Whisper

Whisper是一個用於自動語音識別（ASR）和語音翻譯的預訓練模型。它在68萬個小時的標註數據上進行訓練，無需微調，就能在許多數據集和領域中展現出強大的泛化能力。

🚀 快速開始

Whisper模型可以用於語音識別和語音翻譯任務。要使用該模型轉錄音頻樣本，需要結合使用WhisperProcessor對音頻輸入進行預處理和對模型輸出進行後處理。

✨ 主要特性

多語言支持：支持多種語言，包括英語、中文、德語、西班牙語等眾多語言。
強大泛化能力：在68萬個小時的標註數據上訓練，無需微調即可在多數據集和領域中表現出色。
任務靈活：可執行語音識別和語音翻譯任務。

📦 安裝指南

文檔未提供安裝步驟，此處跳過。

💻 使用示例

基礎用法

以下是使用Whisper模型進行英語語音識別的示例：

>>> from transformers import WhisperProcessor, WhisperForConditionalGeneration
>>> from datasets import load_dataset

>>> # 加載模型和處理器
>>> processor = WhisperProcessor.from_pretrained("openai/whisper-base")
>>> model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-base")
>>> model.config.forced_decoder_ids = None

>>> # 加載虛擬數據集並讀取音頻文件
>>> ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
>>> sample = ds[0]["audio"]
>>> input_features = processor(sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt").input_features 

>>> # 生成令牌ID
>>> predicted_ids = model.generate(input_features)
>>> # 將令牌ID解碼為文本
>>> transcription = processor.batch_decode(predicted_ids, skip_special_tokens=False)
['<|startoftranscript|> <|en|> <|transcribe|> <|notimestamps|> Mr. Quilter is the apostle of the middle classes and we are glad to welcome his gospel.<|endoftext|>']

>>> transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
[' Mr. Quilter is the apostle of the middle classes and we are glad to welcome his gospel.']

高級用法

法語到英語的語音翻譯

>>> from transformers import WhisperProcessor, WhisperForConditionalGeneration
>>> from datasets import Audio, load_dataset

>>> # 加載模型和處理器
>>> processor = WhisperProcessor.from_pretrained("openai/whisper-base")
>>> model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-base")
>>> forced_decoder_ids = processor.get_decoder_prompt_ids(language="french", task="translate")

>>> # 加載流式數據集並讀取第一個音頻樣本
>>> ds = load_dataset("common_voice", "fr", split="test", streaming=True)
>>> ds = ds.cast_column("audio", Audio(sampling_rate=16_000))
>>> input_speech = next(iter(ds))["audio"]
>>> input_features = processor(input_speech["array"], sampling_rate=input_speech["sampling_rate"], return_tensors="pt").input_features

>>> # 生成令牌ID
>>> predicted_ids = model.generate(input_features, forced_decoder_ids=forced_decoder_ids)
>>> # 將令牌ID解碼為文本
>>> transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
[' A very interesting work, we will finally be given on this subject.']

📚 詳細文檔

模型信息

屬性	詳情
模型類型	基於Transformer的編碼器 - 解碼器模型，也稱為序列到序列模型
訓練數據	模型在從互聯網收集的68萬個小時的音頻及相應轉錄文本上進行訓練。其中65%（即43.8萬個小時）是英語音頻和匹配的英語轉錄文本，約18%（即12.6萬個小時）是非英語音頻和英語轉錄文本，最後的17%（即11.7萬個小時）是非英語音頻和相應的轉錄文本，這些非英語數據代表了98種不同的語言。

上下文令牌

模型通過傳遞適當的“上下文令牌”來執行相應的任務（轉錄或翻譯）。典型的上下文令牌序列如下：

<|startoftranscript|> <|en|> <|transcribe|> <|notimestamps|>

這告訴模型以英語進行解碼，執行語音識別任務，並且不預測時間戳。這些令牌可以是強制的或非強制的，強制時可控制模型的輸出語言和任務。

長音頻轉錄

Whisper模型本質上設計用於處理時長最長為30秒的音頻樣本。但通過使用分塊算法，可藉助Transformers的pipeline方法對任意長度的音頻樣本進行轉錄。分塊通過在實例化管道時設置chunk_length_s = 30來啟用。

評估

以下代碼展示瞭如何在LibriSpeech test - clean上評估Whisper Base模型：

>>> from datasets import load_dataset
>>> from transformers import WhisperForConditionalGeneration, WhisperProcessor
>>> import torch
>>> from evaluate import load

>>> librispeech_test_clean = load_dataset("librispeech_asr", "clean", split="test")

>>> processor = WhisperProcessor.from_pretrained("openai/whisper-base")
>>> model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-base").to("cuda")

>>> def map_to_pred(batch):
>>>     audio = batch["audio"]
>>>     input_features = processor(audio["array"], sampling_rate=audio["sampling_rate"], return_tensors="pt").input_features
>>>     batch["reference"] = processor.tokenizer._normalize(batch['text'])
>>> 
>>>     with torch.no_grad():
>>>         predicted_ids = model.generate(input_features.to("cuda"))[0]
>>>     transcription = processor.decode(predicted_ids)
>>>     batch["prediction"] = processor.tokenizer._normalize(transcription)
>>>     return batch

>>> result = librispeech_test_clean.map(map_to_pred)

>>> wer = load("wer")
>>> print(100 * wer.compute(references=result["reference"], predictions=result["prediction"]))
5.082316555716899

微調

預訓練的Whisper模型在不同數據集和領域中具有很強的泛化能力。但通過微調，可以進一步提高其在某些語言和任務上的預測能力。博客文章[Fine - Tune Whisper with 🤗 Transformers](https://huggingface.co/blog/fine - tune - whisper)提供了使用最少5個小時的標註數據微調Whisper模型的分步指南。

使用建議

模型主要在ASR和英語語音翻譯任務上進行訓練和評估，在約10種語言中顯示出強大的ASR結果。在特定上下文和領域中部署模型之前，建議進行充分評估。
請勿使用Whisper模型在未經個人同意的情況下轉錄其錄音，或用於任何主觀分類。不建議在高風險領域（如決策場景）中使用，因為準確性缺陷可能導致結果出現明顯缺陷。

模型結果

任務	數據集	指標	值
自動語音識別	LibriSpeech (clean)	測試字錯誤率 (Test WER)	5.008769117619326
自動語音識別	LibriSpeech (other)	測試字錯誤率 (Test WER)	12.84936273212057
自動語音識別	Common Voice 11.0	測試字錯誤率 (Test WER)	131

模型檢查點

大小	參數	僅英語	多語言
tiny	39 M	[✓](https://huggingface.co/openai/whisper - tiny.en)	[✓](https://huggingface.co/openai/whisper - tiny)
base	74 M	[✓](https://huggingface.co/openai/whisper - base.en)	[✓](https://huggingface.co/openai/whisper - base)
small	244 M	[✓](https://huggingface.co/openai/whisper - small.en)	[✓](https://huggingface.co/openai/whisper - small)
medium	769 M	[✓](https://huggingface.co/openai/whisper - medium.en)	[✓](https://huggingface.co/openai/whisper - medium)
large	1550 M	x	[✓](https://huggingface.co/openai/whisper - large)
large - v2	1550 M	x	[✓](https://huggingface.co/openai/whisper - large - v2)

🔧 技術細節

模型架構

Whisper是基於Transformer的編碼器 - 解碼器模型，也稱為序列到序列模型。

訓練方式

模型在大規模弱監督下，使用從互聯網收集的680,000小時音頻及相應轉錄文本進行訓練。

侷限性

幻覺問題：由於模型在大規模噪聲數據上進行弱監督訓練，預測結果可能包含音頻輸入中實際未說出的文本。
語言表現不均：模型在不同語言上的表現不均衡，在低資源和/或低可發現性語言或訓練數據較少的語言上準確率較低。
重複文本問題：模型的序列到序列架構使其容易生成重複文本，雖然可以通過束搜索和溫度調度在一定程度上緩解，但無法完全解決。

📄 許可證

本模型使用的許可證為apache - 2.0。

BibTeX引用

@misc{radford2022whisper,
  doi = {10.48550/ARXIV.2212.04356},
  url = {https://arxiv.org/abs/2212.04356},
  author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
  title = {Robust Speech Recognition via Large-Scale Weak Supervision},
  publisher = {arXiv},
  year = {2022},
  copyright = {arXiv.org perpetual, non-exclusive license}
}