Distil-large-v2开源语音识别模型 - 速度快6倍体积小49%精准度高

首页

Distil Large V2

由 distil-whisper 开发

Distil-Whisper是Whisper模型的蒸馏版本，速度提升6倍，体积缩小49%，在非分布评估集上的表现仅相差1% WER。

语音识别英语开源协议:MIT #英语语音识别 #高效推理 #长音频处理

下载量 42.65k

发布时间 : 10/24/2023

模型简介

Distil-Whisper是Whisper模型的蒸馏版本，专为英语语音识别优化，提供高效的自动语音识别能力。

模型特点

高效推理

速度比原始Whisper模型快6倍，适合实时应用。

体积优化

模型体积缩小49%，减少内存占用。

高性能

在非分布评估集上的表现仅比原始模型差1% WER。

长格式转录支持

支持分块算法处理长格式音频，速度比顺序算法快9倍。

模型能力

英语语音识别

短格式音频转录

长格式音频转录

推测解码

使用案例

语音转录

会议记录

将会议录音转换为文字记录。

播客转录

将播客内容转换为文字以便搜索和存档。

辅助技术

实时字幕生成

为视频或直播生成实时字幕。

🚀 Distil-Whisper: distil-large-v2

Distil-Whisper是一个经过知识蒸馏的语音识别模型，它在速度和模型大小上对Whisper模型进行了优化，在保持相近识别准确率的同时，实现了更快的推理速度和更小的模型体积。

🚀 快速开始

Distil-Whisper从Hugging Face 🤗 Transformers的4.35版本开始得到支持。要运行该模型，首先需要安装最新版本的Transformers库。在本示例中，我们还将安装 🤗 Datasets 以从Hugging Face Hub加载玩具音频数据集：

pip install --upgrade pip
pip install --upgrade transformers accelerate datasets[audio]

✨ 主要特性

速度提升：相比原始的Whisper模型，Distil-Whisper速度提升了6倍。
模型体积减小：模型大小比Whisper小49%。
准确率相近：在分布外评估集上，字错误率（WER）与Whisper相差在1%以内。
多场景支持：支持短音频转录、长音频转录、推测解码等多种场景。

📦 安装指南

要使用Distil-Whisper，需要安装相关依赖库。以下是安装命令：

pip install --upgrade pip
pip install --upgrade transformers accelerate datasets[audio]

💻 使用示例

基础用法

短音频转录

模型可以使用 pipeline 类对短音频文件（< 30秒）进行转录，示例代码如下：

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "distil-whisper/distil-large-v2"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    torch_dtype=torch_dtype,
    device=device,
)

dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
sample = dataset[0]["audio"]

result = pipe(sample)
print(result["text"])

要转录本地音频文件，只需在调用pipeline时传入音频文件的路径：

- result = pipe(sample)
+ result = pipe("audio.mp3")

长音频转录

Distil-Whisper使用分块算法对长音频文件（> 30秒）进行转录。在实践中，这种分块长音频算法比OpenAI在Whisper论文中提出的顺序算法快9倍。要启用分块，需要在 pipeline 中传入 chunk_length_s 参数。对于Distil-Whisper，15秒的分块长度是最优的。要启用批处理，需要传入 batch_size 参数：

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "distil-whisper/distil-large-v2"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    chunk_length_s=15,
    batch_size=16,
    torch_dtype=torch_dtype,
    device=device,
)

dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
sample = dataset[0]["audio"]

result = pipe(sample)
print(result["text"])

高级用法

推测解码

Distil-Whisper可以作为Whisper的辅助模型用于推测解码。推测解码在数学上保证了与Whisper相同的输出，同时速度提高了2倍。这使得它成为现有Whisper管道的完美替代品，因为保证了相同的输出。在以下代码片段中，我们将辅助Distil-Whisper模型独立加载到主Whisper管道中，然后将其指定为生成的“辅助模型”：

from transformers import pipeline, AutoModelForCausalLM, AutoModelForSpeechSeq2Seq, AutoProcessor
import torch
from datasets import load_dataset

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

assistant_model_id = "distil-whisper/distil-large-v2"

assistant_model = AutoModelForCausalLM.from_pretrained(
    assistant_model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
assistant_model.to(device)

model_id = "openai/whisper-large-v2"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    generate_kwargs={"assistant_model": assistant_model},
    torch_dtype=torch_dtype,
    device=device,
)

dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
sample = dataset[0]["audio"]

result = pipe(sample)
print(result["text"])

📚 详细文档

额外的速度和内存优化

Flash Attention

如果你的GPU支持，我们建议使用 Flash-Attention 2。要使用它，首先需要安装 Flash Attention：

pip install flash-attn --no-build-isolation

然后，只需在 from_pretrained 中传入 use_flash_attention_2=True：

- model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True)
+ model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True, use_flash_attention_2=True)

Torch Scale-Product-Attention (SDPA)

如果你的GPU不支持Flash Attention，我们建议使用 BetterTransformers。要使用它，首先需要安装optimum：

pip install --upgrade optimum

然后，在使用模型之前将其转换为“BetterTransformer”模型：

model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True)
+ model = model.to_bettertransformer()

在 `openai-whisper` 中运行Distil-Whisper

要以原始Whisper格式使用该模型，首先需要确保安装了 openai-whisper 包：

pip install --upgrade openai-whisper

以下代码片段展示了如何转录使用 🤗 Datasets 加载的LibriSpeech数据集中的样本文件：

import torch
from datasets import load_dataset
from huggingface_hub import hf_hub_download
from whisper import load_model, transcribe

distil_large_v2 = hf_hub_download(repo_id="distil-whisper/distil-large-v2", filename="original-model.bin")
model = load_model(distil_large_v2)

dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
sample = dataset[0]["audio"]["array"]
sample = torch.from_numpy(sample).float()

pred_out = transcribe(model, audio=sample)
print(pred_out["text"])

要转录本地音频文件，只需将音频文件的路径作为 audio 参数传递给 transcribe：

pred_out = transcribe(model, audio="audio.mp3")

Whisper.cpp

Distil-Whisper可以使用 Whisper.cpp 仓库中的原始顺序长音频转录算法运行。在Mac M1上的临时基准测试中，distil-large-v2 比 large-v2 快2倍，同时在长音频上的WER相差在0.1%以内。开始使用的步骤如下：

克隆Whisper.cpp仓库：

git clone https://github.com/ggerganov/whisper.cpp.git
cd whisper.cpp

从Hugging Face Hub下载 distil-medium.en 的ggml权重：

python -c "from huggingface_hub import hf_hub_download; hf_hub_download(repo_id='distil-whisper/distil-large-v2', filename='ggml-large-32-2.en.bin', local_dir='./models')"

如果你没有安装 huggingface_hub 包，也可以使用 wget 下载权重：

wget https://huggingface.co/distil-whisper/distil-large-v2/resolve/main/ggml-large-32-2.en.bin -P ./models

使用提供的示例音频运行推理：

make -j && ./main -m models/ggml-large-32-2.en.bin -f samples/jfk.wav

Transformers.js

import { pipeline } from '@huggingface/transformers';

const transcriber = await pipeline('automatic-speech-recognition', 'distil-whisper/distil-large-v2');

const url = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/jfk.wav';
const output = await transcriber(url);
// { text: " And so, my fellow Americans, ask not what your country can do for you. Ask what you can do for your country." }

更多信息请参阅文档。注意：由于模型较大，我们建议在服务器端使用 Node.js 运行该模型（而不是在浏览器中）。

Candle

通过与Hugging Face Candle 🕯️ 的集成，Distil-Whisper现在可以在Rust库 🦀 中使用。其优势包括：

优化的CPU后端，x86可选MKL支持，Mac可选Accelerate支持。
CUDA后端，可在GPU上高效运行，通过NCCL实现多GPU分发。
WASM支持：可在浏览器中运行Distil-Whisper。开始使用的步骤如下：

按照此处的说明安装 candle-core。
本地克隆 candle 仓库：

git clone https://github.com/huggingface/candle.git

cd candle/candle-examples/examples/whisper

运行示例：

cargo run --example whisper --release -- --model distil-large-v2

要指定自己的音频文件，添加 --input 标志：

cargo run --example whisper --release -- --model distil-large-v2 --input audio.wav

模型评估

以下代码片段展示了如何使用流式模式在LibriSpeech验证集上评估Distil-Whisper模型，这意味着无需将音频数据下载到本地设备。首先，我们需要安装所需的包，包括 🤗 Datasets 以流式加载音频数据，以及 🤗 Evaluate 以进行WER计算：

pip install --upgrade pip
pip install --upgrade transformers datasets[audio] evaluate jiwer

然后可以使用以下示例端到端地运行评估：

from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
from transformers.models.whisper.english_normalizer import EnglishTextNormalizer
from datasets import load_dataset
from evaluate import load
import torch
from tqdm import tqdm

# define our torch configuration
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "distil-whisper/distil-large-v2"

# load the model + processor
model =  AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, use_safetensors=True, low_cpu_mem_usage=True)
model = model.to(device)
processor = AutoProcessor.from_pretrained(model_id)

# load the dataset with streaming mode
dataset = load_dataset("librispeech_asr", "clean", split="validation", streaming=True)

# define the evaluation metric
wer_metric = load("wer")
normalizer = EnglishTextNormalizer(processor.tokenizer.english_spelling_normalizer)

def inference(batch):
    # 1. Pre-process the audio data to log-mel spectrogram inputs
    audio = [sample["array"] for sample in batch["audio"]]
    input_features = processor(audio, sampling_rate=batch["audio"][0]["sampling_rate"], return_tensors="pt").input_features
    input_features = input_features.to(device, dtype=torch_dtype)
    
    # 2. Auto-regressively generate the predicted token ids
    pred_ids = model.generate(input_features, max_new_tokens=128, language="en", task="transcribe")
    
    # 3. Decode the token ids to the final transcription
    batch["transcription"] = processor.batch_decode(pred_ids, skip_special_tokens=True)
    batch["reference"] = batch["text"]
    return batch

dataset = dataset.map(function=inference, batched=True, batch_size=16)

all_transcriptions = []
all_references = []

# iterate over the dataset and run inference
for i, result in tqdm(enumerate(dataset), desc="Evaluating..."):
    all_transcriptions.append(result["transcription"])
    all_references.append(result["reference"])

# normalize predictions and references
all_transcriptions = [normalizer(transcription) for transcription in all_transcriptions]
all_references = [normalizer(reference) for reference in all_references]

# compute the WER metric
wer = 100 * wer_metric.compute(predictions=all_transcriptions, references=all_references)
print(wer)

打印输出：

2.983685535968466

🔧 技术细节

模型架构

Distil-Whisper继承了Whisper的编码器 - 解码器架构。编码器将语音向量输入序列映射到隐藏状态向量序列，解码器根据所有先前的标记和编码器的隐藏状态自回归地预测文本标记。因此，编码器只向前运行一次，而解码器运行的次数与生成的标记数量相同。在实践中，这意味着解码器占总推理时间的90%以上。因此，为了优化延迟，应重点关注最小化解码器的推理时间。

为了对Whisper模型进行知识蒸馏，我们在保持编码器固定的同时减少了解码器层的数量。编码器（绿色部分）完全从教师模型复制到学生模型，并在训练期间冻结。学生模型的解码器仅由两个解码器层组成，它们从教师模型的第一个和最后一个解码器层初始化（红色部分）。教师模型的所有其他解码器层都被丢弃。然后，模型在KL散度和伪标签损失项的加权和上进行训练。

训练数据

Distil-Whisper在来自Hugging Face Hub上9个开源、许可宽松的语音数据集的22,000小时音频数据上进行训练：

数据集	时长 / 小时	说话者数量	领域	许可证
People's Speech	12,000	未知	Internet Archive	CC-BY-SA-4.0
Common Voice 13	3,000	未知	Narrated Wikipedia	CC0-1.0
GigaSpeech	2,500	未知	Audiobook, podcast, YouTube	apache-2.0
Fisher	1,960	11,900	电话对话	LDC
LibriSpeech	960	2,480	有声读物	CC-BY-4.0
VoxPopuli	540	1,310	欧洲议会	CC0
TED-LIUM	450	2,030	TED演讲	CC-BY-NC-ND 3.0
SwitchBoard	260	540	电话对话	LDC
AMI	100	未知	会议	CC-BY-4.0

总计	21,770	18,260+

组合数据集涵盖10个不同领域和超过50,000名说话者。这种数据集的多样性对于确保蒸馏模型对音频分布和噪声具有鲁棒性至关重要。

音频数据然后使用Whisper large-v2模型进行伪标签标注：我们使用Whisper为训练集中的所有音频生成预测，并在训练期间将这些预测用作目标标签。使用伪标签确保了转录在数据集之间的格式一致，并在训练期间提供了序列级的蒸馏信号。

WER过滤

Whisper的伪标签预测可能会出现错误转录和幻觉。为了确保我们只在准确的伪标签上进行训练，我们在训练期间采用了一种简单的WER启发式方法。首先，我们对Whisper的伪标签和每个数据集提供的真实标签进行归一化。然后，我们计算这些标签之间的WER。如果WER超过指定的阈值，我们将丢弃该训练示例；否则，我们将其保留用于训练。

Distil-Whisper论文的第9.2节展示了这种过滤方法对于提高蒸馏模型下游性能的有效性。我们还将Distil-Whisper对幻觉的鲁棒性部分归因于这种过滤方法。

训练过程

模型经过80,000次优化步骤（即8个epoch）的训练。Tensorboard训练日志可在以下链接找到：https://huggingface.co/distil-whisper/distil-large-v2/tensorboard?params=scalars#frame

评估结果

蒸馏后的模型在分布外（OOD）短音频上的WER与Whisper相差在1%以内，在OOD长音频上的表现比Whisper好0.1%。这种性能提升归因于更低的幻觉率。

有关每个数据集评估结果的详细细分，请参考 Distil-Whisper论文的表16和表17。

Distil-Whisper还在 ESB基准测试数据集上进行了评估，作为 OpenASR排行榜的一部分，其WER与Whisper相差在0.2%以内。

📄 许可证

Distil-Whisper继承了OpenAI的Whisper模型的 MIT许可证。

引用

如果你使用了该模型，请考虑引用 Distil-Whisper论文：

@misc{gandhi2023distilwhisper,
      title={Distil-Whisper: Robust Knowledge Distillation via Large-Scale Pseudo Labelling}, 
      author={Sanchit Gandhi and Patrick von Platen and Alexander M. Rush},
      year={2023},
      eprint={2311.00430},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

致谢

OpenAI提供了Whisper 模型和原始代码库。
Hugging Face 🤗 Transformers 实现了模型集成。
Google的 TPU研究云（TRC）计划提供了Cloud TPU v4。
@rsonavane 在LibriSpeech数据集上发布了Distil-Whisper的早期版本。

⚠️ 重要提示

Distil-Whisper目前仅支持英语语音识别。我们正在与社区合作，对其他语言的Whisper进行知识蒸馏。如果你有兴趣对自己语言的Whisper进行蒸馏，请查看提供的训练代码。当准备好时，我们将使用多语言检查点更新 Distil-Whisper仓库！

💡 使用建议

由于OpenAI发布了Whisper large-v3，一个更新的 distil-large-v3 模型已经发布。这个 distil-large-v3 模型在不改变架构的情况下超越了distil-large-v2模型的性能，并且对顺序长格式生成有更好的支持。因此，建议使用 distil-large-v3 模型代替large-v2模型。