Whisper-large开源语音模型 - 免费部署实现自动语音识别与翻译

首页

Whisper Large

由 openai 开发

Whisper是一个用于自动语音识别(ASR)和语音翻译的预训练模型，在68万小时标注数据上训练，具有强大的泛化能力。

语音识别支持多种语言开源协议:Apache-2.0 #多语言语音识别 #高精度转录 #语音翻译

下载量 175.34k

发布时间 : 9/26/2022

模型简介

Whisper是基于Transformer的编码器-解码器模型，支持多语言语音识别和翻译任务，无需微调即可适应多种数据集。

模型特点

大规模预训练

在68万小时的标注语音数据上训练，具有强大的泛化能力

多语言支持

支持96种语言的语音识别和翻译任务

零样本学习

无需微调即可适应多种数据集和领域

多功能任务

同时支持语音识别(同语言转录)和语音翻译(跨语言翻译)

模型能力

英语语音识别

多语言语音识别

语音翻译

音频转录

自动字幕生成

使用案例

语音转录

会议记录

将会议录音自动转录为文字记录

在LibriSpeech测试集上WER(词错误率)为3.0(干净)和5.4(其他)

播客字幕

为播客内容生成自动字幕

语音翻译

实时翻译

将一种语言的语音实时翻译为另一种语言的文字

🚀 Whisper

Whisper是一个用于自动语音识别（ASR）和语音翻译的预训练模型。它在68万小时的标注数据上进行训练，无需微调，就能在许多数据集和领域中展现出强大的泛化能力。

🚀 快速开始

Whisper是一个基于Transformer的编码器 - 解码器模型，也被称为_序列到序列_模型。它使用大规模弱监督对68万小时的标注语音数据进行训练。

✨ 主要特性

多语言支持：支持多种语言，包括英语、中文、德语、西班牙语等众多语言。
多种任务：可用于语音识别和语音翻译任务。
不同模型大小：有五种不同大小配置的检查点可供选择。

📦 安装指南

文档中未提及安装步骤，故跳过此章节。

💻 使用示例

基础用法

要转录音频样本，模型必须与WhisperProcessor一起使用。以下是设置上下文令牌的示例：

model.config.forced_decoder_ids = WhisperProcessor.get_decoder_prompt_ids(language="english", task="transcribe")

这将强制模型在语音识别任务中以英语进行预测。

高级用法

以下是不同场景下的使用示例：

英语到英语转录

>>> from transformers import WhisperProcessor, WhisperForConditionalGeneration
>>> from datasets import load_dataset

>>> # 加载模型和处理器
>>> processor = WhisperProcessor.from_pretrained("openai/whisper-large")
>>> model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-large")
>>> model.config.forced_decoder_ids = None

>>> # 加载虚拟数据集并读取音频文件
>>> ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
>>> sample = ds[0]["audio"]
>>> input_features = processor(sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt").input_features 

>>> # 生成令牌ID
>>> predicted_ids = model.generate(input_features)
>>> # 将令牌ID解码为文本
>>> transcription = processor.batch_decode(predicted_ids, skip_special_tokens=False)
['<|startoftranscript|> <|en|> <|transcribe|> <|notimestamps|> Mr. Quilter is the apostle of the middle classes and we are glad to welcome his gospel.<|endoftext|>']

>>> transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
[' Mr. Quilter is the apostle of the middle classes and we are glad to welcome his gospel.']

法语到法语转录

>>> from transformers import WhisperProcessor, WhisperForConditionalGeneration
>>> from datasets import Audio, load_dataset

>>> # 加载模型和处理器
>>> processor = WhisperProcessor.from_pretrained("openai/whisper-large")
>>> model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-large")
>>> forced_decoder_ids = processor.get_decoder_prompt_ids(language="french", task="transcribe")

>>> # 加载流式数据集并读取第一个音频样本
>>> ds = load_dataset("common_voice", "fr", split="test", streaming=True)
>>> ds = ds.cast_column("audio", Audio(sampling_rate=16_000))
>>> input_speech = next(iter(ds))["audio"]
>>> input_features = processor(input_speech["array"], sampling_rate=input_speech["sampling_rate"], return_tensors="pt").input_features

>>> # 生成令牌ID
>>> predicted_ids = model.generate(input_features, forced_decoder_ids=forced_decoder_ids)
>>> # 将令牌ID解码为文本
>>> transcription = processor.batch_decode(predicted_ids)
['<|startoftranscript|> <|fr|> <|transcribe|> <|notimestamps|> Un vrai travail intéressant va enfin être mené sur ce sujet.<|endoftext|>']

>>> transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
[' Un vrai travail intéressant va enfin être mené sur ce sujet.']

法语到英语翻译

>>> from transformers import WhisperProcessor, WhisperForConditionalGeneration
>>> from datasets import Audio, load_dataset

>>> # 加载模型和处理器
>>> processor = WhisperProcessor.from_pretrained("openai/whisper-large")
>>> model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-large")
>>> forced_decoder_ids = processor.get_decoder_prompt_ids(language="french", task="translate")

>>> # 加载流式数据集并读取第一个音频样本
>>> ds = load_dataset("common_voice", "fr", split="test", streaming=True)
>>> ds = ds.cast_column("audio", Audio(sampling_rate=16_000))
>>> input_speech = next(iter(ds))["audio"]
>>> input_features = processor(input_speech["array"], sampling_rate=input_speech["sampling_rate"], return_tensors="pt").input_features

>>> # 生成令牌ID
>>> predicted_ids = model.generate(input_features, forced_decoder_ids=forced_decoder_ids)
>>> # 将令牌ID解码为文本
>>> transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
[' A very interesting work, we will finally be given on this subject.']

评估示例

以下是在LibriSpeech test-clean上评估Whisper Large的代码示例：

>>> from datasets import load_dataset
>>> from transformers import WhisperForConditionalGeneration, WhisperProcessor
>>> import torch
>>> from evaluate import load

>>> librispeech_test_clean = load_dataset("librispeech_asr", "clean", split="test")

>>> processor = WhisperProcessor.from_pretrained("openai/whisper-large")
>>> model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-large").to("cuda")

>>> def map_to_pred(batch):
>>>     audio = batch["audio"]
>>>     input_features = processor(audio["array"], sampling_rate=audio["sampling_rate"], return_tensors="pt").input_features
>>>     batch["reference"] = processor.tokenizer._normalize(batch['text'])
>>> 
>>>     with torch.no_grad():
>>>         predicted_ids = model.generate(input_features.to("cuda"))[0]
>>>     transcription = processor.decode(predicted_ids)
>>>     batch["prediction"] = processor.tokenizer._normalize(transcription)
>>>     return batch

>>> result = librispeech_test_clean.map(map_to_pred)

>>> wer = load("wer")
>>> print(100 * wer.compute(references=result["reference"], predictions=result["prediction"]))
3.0003583080317572

长格式转录示例

>>> import torch
>>> from transformers import pipeline
>>> from datasets import load_dataset

>>> device = "cuda:0" if torch.cuda.is_available() else "cpu"

>>> pipe = pipeline(
>>>   "automatic-speech-recognition",
>>>   model="openai/whisper-large",
>>>   chunk_length_s=30,
>>>   device=device,
>>> )

>>> ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
>>> sample = ds[0]["audio"]

>>> prediction = pipe(sample.copy(), batch_size=8)["text"]
" Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel."

>>> # 我们还可以返回预测的时间戳
>>> prediction = pipe(sample.copy(), batch_size=8, return_timestamps=True)["chunks"]
[{'text': ' Mr. Quilter is the apostle of the middle classes and we are glad to welcome his gospel.',
  'timestamp': (0.0, 5.44)}]

📚 详细文档

模型详情

Whisper是一个基于Transformer的编码器 - 解码器模型，也称为_序列到序列_模型。它使用大规模弱监督对68万小时的标注语音数据进行训练。

模型在仅英语数据或多语言数据上进行训练。仅英语模型用于语音识别任务，多语言模型用于语音识别和语音翻译任务。对于语音识别，模型预测与音频相同语言的转录；对于语音翻译，模型预测与音频不同语言的转录。

Whisper检查点有五种不同模型大小的配置。最小的四个在仅英语或多语言数据上训练，最大的检查点仅为多语言。所有十个预训练检查点都可以在Hugging Face Hub上找到。检查点总结如下表所示：

属性	详情
模型类型	Whisper是基于Transformer的编码器 - 解码器模型，有五种不同大小配置的检查点，包括tiny、base、small、medium、large和large - v2。最小的四个可在仅英语或多语言数据上训练，最大的仅为多语言。
训练数据	模型在68万小时的音频及对应的转录文本上训练，这些数据从互联网收集。其中65%（43.8万小时）是英语音频和匹配的英语转录，约18%（12.6万小时）是非英语音频和英语转录，最后17%（11.7万小时）是非英语音频和对应的转录，这些非英语数据代表98种不同语言。

上下文令牌说明

上下文令牌用于告知模型要执行的任务（转录或翻译）。典型的上下文令牌序列如下：

<|startoftranscript|> <|en|> <|transcribe|> <|notimestamps|>

这告诉模型以英语进行解码，执行语音识别任务，并且不预测时间戳。这些令牌可以是强制的或非强制的。如果是强制的，模型会在每个位置预测每个令牌，从而控制输出语言和任务；如果是非强制的，模型会自动预测输出语言和任务。

🔧 技术细节

模型架构

Whisper是一个基于Transformer的编码器 - 解码器模型，也被称为_序列到序列_模型。它在大规模标注语音数据上进行训练，使用弱监督学习方法。

训练数据

模型在68万小时的音频和对应的转录文本上训练，这些数据从互联网收集。不同语言的数据分布不同，性能与该语言的训练数据量直接相关。

局限性

幻觉问题：由于模型使用大规模噪声数据进行弱监督训练，预测结果可能包含音频输入中实际未说出的文本。
语言性能不均：模型在不同语言上的表现不均匀，在低资源和/或低可发现性语言或训练数据较少的语言上准确性较低。
口音和方言差异：在特定语言的不同口音和方言上表现不同，可能导致不同性别、种族、年龄或其他人口统计标准的说话者的单词错误率较高。
重复文本问题：模型的序列到序列架构使其容易生成重复文本，虽然可以通过束搜索和温度调度在一定程度上缓解，但不能完全解决。

📄 许可证

本模型使用的许可证为apache - 2.0。

其他信息

支持语言

支持以下语言：

en、zh、de、es、ru、ko、fr、ja、pt、tr、pl、ca、nl、ar、sv、it、id、hi、fi、vi、he、uk、el、ms、cs、ro、da、hu、ta、no、th、ur、hr、bg、lt、la、mi、ml、cy、sk、te、fa、lv、bn、sr、az、sl、kn、et、mk、br、eu、is、hy、ne、mn、bs、kk、sq、sw、gl、mr、pa、si、km、sn、yo、so、af、oc、ka、be、tg、sd、gu、am、yi、lo、uz、fo、ht、ps、tk、nn、mt、sa、lb、my、bo、tl、mg、as、tt、haw、ln、ha、ba、jw、su

小部件示例

Librispeech sample 1：音频链接
Librispeech sample 2：音频链接

模型索引

名称：whisper - large
结果：
- 在LibriSpeech (clean)数据集的测试集上，自动语音识别任务的测试WER为3.0。
- 在LibriSpeech (other)数据集的测试集上，自动语音识别任务的测试WER为5.4。
- 在Common Voice 11.0数据集（语言为hi）的测试集上，自动语音识别任务的测试WER为54.8。

重要提示和使用建议

⚠️ 重要提示

建议使用large - v2模型代替原始的大模型，因为它在更多轮次上训练并使用了正则化，性能更优。

谨慎使用Whisper模型转录未经个人同意的录音，或用于任何主观分类。不建议在高风险领域（如决策场景）使用，因为准确性缺陷可能导致结果出现明显缺陷。模型旨在转录和翻译语音，用于分类不仅未经过评估，而且不合适，尤其是推断人类属性。