Whisper-base开源语音模型 - 免费使用精准实现语音识别与翻译

首页

Whisper Base

由 openai 开发

Whisper是一个预训练的自动语音识别(ASR)和语音翻译模型，经过68万小时标注数据训练，具有强大的泛化能力。

语音识别支持多种语言开源协议:Apache-2.0 #多语言语音识别 #零样本翻译 #大规模弱监督

下载量 491.35k

发布时间 : 9/26/2022

模型简介

Whisper是基于Transformer的编码器-解码器模型，支持多种语言的语音识别和翻译任务，无需微调即可适应不同数据集和领域。

模型特点

大规模预训练

使用68万小时标注语音数据训练，具有强大的泛化能力

多语言支持

支持99种语言的语音识别和翻译任务

零样本学习

无需微调即可适应不同数据集和领域

多功能任务

同时支持语音识别和语音翻译两种任务模式

模型能力

英语语音识别

多语言语音识别

跨语言语音翻译

音频转录

语音转文本

使用案例

语音转录

会议记录

将会议录音自动转录为文字记录

在LibriSpeech清晰测试集上WER为5.01

播客转录

将播客内容转换为可搜索的文本

语音翻译

实时翻译

将一种语言的语音实时翻译为另一种语言的文本

🚀 语音识别模型Whisper

Whisper是一个用于自动语音识别（ASR）和语音翻译的预训练模型。它在68万个小时的标注数据上进行训练，无需微调，就能在许多数据集和领域中展现出强大的泛化能力。

🚀 快速开始

Whisper模型可以用于语音识别和语音翻译任务。要使用该模型转录音频样本，需要结合使用WhisperProcessor对音频输入进行预处理和对模型输出进行后处理。

✨ 主要特性

多语言支持：支持多种语言，包括英语、中文、德语、西班牙语等众多语言。
强大泛化能力：在68万个小时的标注数据上训练，无需微调即可在多数据集和领域中表现出色。
任务灵活：可执行语音识别和语音翻译任务。

📦 安装指南

文档未提供安装步骤，此处跳过。

💻 使用示例

基础用法

以下是使用Whisper模型进行英语语音识别的示例：

>>> from transformers import WhisperProcessor, WhisperForConditionalGeneration
>>> from datasets import load_dataset

>>> # 加载模型和处理器
>>> processor = WhisperProcessor.from_pretrained("openai/whisper-base")
>>> model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-base")
>>> model.config.forced_decoder_ids = None

>>> # 加载虚拟数据集并读取音频文件
>>> ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
>>> sample = ds[0]["audio"]
>>> input_features = processor(sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt").input_features 

>>> # 生成令牌ID
>>> predicted_ids = model.generate(input_features)
>>> # 将令牌ID解码为文本
>>> transcription = processor.batch_decode(predicted_ids, skip_special_tokens=False)
['<|startoftranscript|> <|en|> <|transcribe|> <|notimestamps|> Mr. Quilter is the apostle of the middle classes and we are glad to welcome his gospel.<|endoftext|>']

>>> transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
[' Mr. Quilter is the apostle of the middle classes and we are glad to welcome his gospel.']

高级用法

法语到英语的语音翻译

>>> from transformers import WhisperProcessor, WhisperForConditionalGeneration
>>> from datasets import Audio, load_dataset

>>> # 加载模型和处理器
>>> processor = WhisperProcessor.from_pretrained("openai/whisper-base")
>>> model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-base")
>>> forced_decoder_ids = processor.get_decoder_prompt_ids(language="french", task="translate")

>>> # 加载流式数据集并读取第一个音频样本
>>> ds = load_dataset("common_voice", "fr", split="test", streaming=True)
>>> ds = ds.cast_column("audio", Audio(sampling_rate=16_000))
>>> input_speech = next(iter(ds))["audio"]
>>> input_features = processor(input_speech["array"], sampling_rate=input_speech["sampling_rate"], return_tensors="pt").input_features

>>> # 生成令牌ID
>>> predicted_ids = model.generate(input_features, forced_decoder_ids=forced_decoder_ids)
>>> # 将令牌ID解码为文本
>>> transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
[' A very interesting work, we will finally be given on this subject.']

📚 详细文档

模型信息

属性	详情
模型类型	基于Transformer的编码器 - 解码器模型，也称为序列到序列模型
训练数据	模型在从互联网收集的68万个小时的音频及相应转录文本上进行训练。其中65%（即43.8万个小时）是英语音频和匹配的英语转录文本，约18%（即12.6万个小时）是非英语音频和英语转录文本，最后的17%（即11.7万个小时）是非英语音频和相应的转录文本，这些非英语数据代表了98种不同的语言。

上下文令牌

模型通过传递适当的“上下文令牌”来执行相应的任务（转录或翻译）。典型的上下文令牌序列如下：

<|startoftranscript|> <|en|> <|transcribe|> <|notimestamps|>

这告诉模型以英语进行解码，执行语音识别任务，并且不预测时间戳。这些令牌可以是强制的或非强制的，强制时可控制模型的输出语言和任务。

长音频转录

Whisper模型本质上设计用于处理时长最长为30秒的音频样本。但通过使用分块算法，可借助Transformers的pipeline方法对任意长度的音频样本进行转录。分块通过在实例化管道时设置chunk_length_s = 30来启用。

评估

以下代码展示了如何在LibriSpeech test - clean上评估Whisper Base模型：

>>> from datasets import load_dataset
>>> from transformers import WhisperForConditionalGeneration, WhisperProcessor
>>> import torch
>>> from evaluate import load

>>> librispeech_test_clean = load_dataset("librispeech_asr", "clean", split="test")

>>> processor = WhisperProcessor.from_pretrained("openai/whisper-base")
>>> model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-base").to("cuda")

>>> def map_to_pred(batch):
>>>     audio = batch["audio"]
>>>     input_features = processor(audio["array"], sampling_rate=audio["sampling_rate"], return_tensors="pt").input_features
>>>     batch["reference"] = processor.tokenizer._normalize(batch['text'])
>>> 
>>>     with torch.no_grad():
>>>         predicted_ids = model.generate(input_features.to("cuda"))[0]
>>>     transcription = processor.decode(predicted_ids)
>>>     batch["prediction"] = processor.tokenizer._normalize(transcription)
>>>     return batch

>>> result = librispeech_test_clean.map(map_to_pred)

>>> wer = load("wer")
>>> print(100 * wer.compute(references=result["reference"], predictions=result["prediction"]))
5.082316555716899

微调

预训练的Whisper模型在不同数据集和领域中具有很强的泛化能力。但通过微调，可以进一步提高其在某些语言和任务上的预测能力。博客文章[Fine - Tune Whisper with 🤗 Transformers](https://huggingface.co/blog/fine - tune - whisper)提供了使用最少5个小时的标注数据微调Whisper模型的分步指南。

使用建议

模型主要在ASR和英语语音翻译任务上进行训练和评估，在约10种语言中显示出强大的ASR结果。在特定上下文和领域中部署模型之前，建议进行充分评估。
请勿使用Whisper模型在未经个人同意的情况下转录其录音，或用于任何主观分类。不建议在高风险领域（如决策场景）中使用，因为准确性缺陷可能导致结果出现明显缺陷。

模型结果

任务	数据集	指标	值
自动语音识别	LibriSpeech (clean)	测试字错误率 (Test WER)	5.008769117619326
自动语音识别	LibriSpeech (other)	测试字错误率 (Test WER)	12.84936273212057
自动语音识别	Common Voice 11.0	测试字错误率 (Test WER)	131

模型检查点

大小	参数	仅英语	多语言
tiny	39 M	[✓](https://huggingface.co/openai/whisper - tiny.en)	[✓](https://huggingface.co/openai/whisper - tiny)
base	74 M	[✓](https://huggingface.co/openai/whisper - base.en)	[✓](https://huggingface.co/openai/whisper - base)
small	244 M	[✓](https://huggingface.co/openai/whisper - small.en)	[✓](https://huggingface.co/openai/whisper - small)
medium	769 M	[✓](https://huggingface.co/openai/whisper - medium.en)	[✓](https://huggingface.co/openai/whisper - medium)
large	1550 M	x	[✓](https://huggingface.co/openai/whisper - large)
large - v2	1550 M	x	[✓](https://huggingface.co/openai/whisper - large - v2)

🔧 技术细节

模型架构

Whisper是基于Transformer的编码器 - 解码器模型，也称为序列到序列模型。

训练方式

模型在大规模弱监督下，使用从互联网收集的680,000小时音频及相应转录文本进行训练。

局限性

幻觉问题：由于模型在大规模噪声数据上进行弱监督训练，预测结果可能包含音频输入中实际未说出的文本。
语言表现不均：模型在不同语言上的表现不均衡，在低资源和/或低可发现性语言或训练数据较少的语言上准确率较低。
重复文本问题：模型的序列到序列架构使其容易生成重复文本，虽然可以通过束搜索和温度调度在一定程度上缓解，但无法完全解决。

📄 许可证

本模型使用的许可证为apache - 2.0。

BibTeX引用

@misc{radford2022whisper,
  doi = {10.48550/ARXIV.2212.04356},
  url = {https://arxiv.org/abs/2212.04356},
  author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
  title = {Robust Speech Recognition via Large-Scale Weak Supervision},
  publisher = {arXiv},
  year = {2022},
  copyright = {arXiv.org perpetual, non-exclusive license}
}