kotoba-whisper-v2.0开源日语语音识别模型 - 免费部署，推理速度猛增6.3倍

首页

Kotoba Whisper V2.0

由 kotoba-tech 开发

Kotoba-Whisper是由Asahi Ushio与Kotoba Technologies合作开发的日语自动语音识别蒸馏模型，基于Whisper large-v3蒸馏而来，推理速度提升6.3倍。

语音识别

Transformers

日语开源协议:Apache-2.0 #日语语音识别 #高效蒸馏模型 #低延迟推理

下载量 8,108

发布时间 : 9/17/2024

模型简介

日语自动语音识别模型，通过知识蒸馏技术优化Whisper large-v3模型，在保持相近错误率的同时显著提升推理速度。

模型特点

高效推理

相比原版Whisper large-v3，推理速度提升6.3倍

高性能

在ReazonSpeech等日语数据集上CER/WER优于原版模型

大规模训练

使用超过720万条日语语音-文本对进行训练

模型能力

日语语音转文本

长音频分段处理

支持Flash Attention 2加速

使用案例

语音转录

电视节目字幕生成

处理日本电视节目音频生成准确字幕

在ReazonSpeech测试集上CER 11.6/WER 55.6

语音助手

为日语语音助手提供快速准确的语音识别能力

🚀 Kotoba-Whisper (v2.0)

Kotoba-Whisper是一系列用于日语自动语音识别（ASR）的蒸馏模型。它由Asahi Ushio和Kotoba Technologies合作开发，在速度上比OpenAI的Whisper large-v3快6.3倍，同时保持了较低的错误率，能有效解决日语语音转录的效率和准确性问题。

🚀 快速开始

Kotoba-Whisper从Hugging Face 🤗 Transformers库的4.39版本开始得到支持。要运行该模型，首先需要安装最新版本的Transformers：

pip install --upgrade pip
pip install --upgrade transformers accelerate

✨ 主要特性

高效性能：Kotoba-Whisper比Whisper large-v3快6.3倍，同时保持了较低的错误率。
多场景适用：支持短音频（< 30秒）和长音频（> 30秒）的转录，提供顺序长格式和分块长格式两种转录算法。
可优化性：可以通过应用额外的速度和内存优化措施，进一步减少推理时间和显存需求。

📦 安装指南

运行Kotoba-Whisper模型，需要安装最新版本的Transformers和其他必要的依赖库：

pip install --upgrade pip
pip install --upgrade transformers accelerate

如果要进行评估，还需要安装以下包：

pip install --upgrade pip
pip install --upgrade transformers datasets[audio] evaluate jiwer

如果要使用Flash Attention 2，需要安装：

pip install flash-attn --no-build-isolation

💻 使用示例

基础用法

短音频转录

该模型可以使用pipeline类来转录短音频文件（< 30秒）：

import torch
from transformers import pipeline
from datasets import load_dataset

# 配置
model_id = "kotoba-tech/kotoba-whisper-v2.0"
torch_dtype = torch.bfloat16 if torch.cuda.is_available() else torch.float32
device = "cuda:0" if torch.cuda.is_available() else "cpu"
model_kwargs = {"attn_implementation": "sdpa"} if torch.cuda.is_available() else {}
generate_kwargs = {"language": "ja", "task": "transcribe"}

# 加载模型
pipe = pipeline(
    "automatic-speech-recognition",
    model=model_id,
    torch_dtype=torch_dtype,
    device=device,
    model_kwargs=model_kwargs
)

# 加载示例音频
dataset = load_dataset("japanese-asr/ja_asr.reazonspeech_test", split="test")
sample = dataset[0]["audio"]

# 运行推理
result = pipe(sample, generate_kwargs=generate_kwargs)
print(result["text"])

要转录本地音频文件，只需在调用pipeline时传入音频文件的路径（确保音频采样率为16kHz）：

- result = pipe(sample, generate_kwargs=generate_kwargs)
+ result = pipe("audio.mp3", generate_kwargs=generate_kwargs)

对于分段级时间戳，传入参数return_timestamps=True并返回"chunks"输出：

result = pipe(sample, return_timestamps=True, generate_kwargs=generate_kwargs)
print(result["chunks"])

顺序长格式转录

Kotoba-whisper设计为与OpenAI的顺序长格式转录算法兼容。该算法使用滑动窗口对长音频文件（> 30秒）进行缓冲推理，与分块长格式算法相比，能返回更准确的转录结果。默认情况下，如果将长音频文件传递给模型，它将使用顺序长格式转录：

import torch
from transformers import pipeline
from datasets import load_dataset

# 配置
model_id = "kotoba-tech/kotoba-whisper-v2.0"
torch_dtype = torch.bfloat16 if torch.cuda.is_available() else torch.float32
device = "cuda:0" if torch.cuda.is_available() else "cpu"
model_kwargs = {"attn_implementation": "sdpa"} if torch.cuda.is_available() else {}
generate_kwargs = {"language": "ja", "task": "transcribe"}

# 加载模型
pipe = pipeline(
    "automatic-speech-recognition",
    model=model_id,
    torch_dtype=torch_dtype,
    device=device,
    model_kwargs=model_kwargs
)

# 加载示例音频（拼接实例以创建长音频）
dataset = load_dataset("japanese-asr/ja_asr.reazonspeech_test", split="test")
sample = {"array": np.concatenate([i["array"] for i in dataset[:20]["audio"]]), "sampling_rate": dataset[0]['audio']['sampling_rate']}

# 运行推理
result = pipe(sample, generate_kwargs=generate_kwargs)
print(result["text"])

高级用法

分块长格式转录

当需要转录单个大音频文件并要求最快的推理速度时，应使用此算法。在这种情况下，分块算法比OpenAI的顺序长格式实现快达9倍：

import torch
from transformers import pipeline
from datasets import load_dataset

# 配置
model_id = "kotoba-tech/kotoba-whisper-v2.0"
torch_dtype = torch.bfloat16 if torch.cuda.is_available() else torch.float32
device = "cuda:0" if torch.cuda.is_available() else "cpu"
model_kwargs = {"attn_implementation": "sdpa"} if torch.cuda.is_available() else {}
generate_kwargs = {"language": "ja", "task": "transcribe"}

# 加载模型
pipe = pipeline(
    "automatic-speech-recognition",
    model=model_id,
    torch_dtype=torch_dtype,
    device=device,
    model_kwargs=model_kwargs,
    batch_size=16
)

# 加载示例音频（拼接实例以创建长音频）
dataset = load_dataset("japanese-asr/ja_asr.reazonspeech_test", split="test")
sample = {"array": np.concatenate([i["array"] for i in dataset[:20]["audio"]]), "sampling_rate": dataset[0]['audio']['sampling_rate']}

# 运行推理
result = pipe(sample, chunk_length_s=15, generate_kwargs=generate_kwargs)
print(result["text"])

额外的速度和内存优化

可以应用额外的速度和内存优化措施，进一步减少推理速度和显存需求。这些优化主要针对注意力内核，将其从急切实现切换到更高效的闪存注意力版本。

Flash Attention 2

如果GPU支持，建议使用Flash-Attention 2。首先需要安装Flash Attention：

pip install flash-attn --no-build-isolation

然后将attn_implementation="flash_attention_2"传递给from_pretrained：

- model_kwargs = {"attn_implementation": "sdpa"} if torch.cuda.is_available() else {}
+ model_kwargs = {"attn_implementation": "flash_attention_2"} if torch.cuda.is_available() else {}

📚 详细文档

模型详情

请参阅https://huggingface.co/distil-whisper/distil-large-v3#model-details。

训练

模型训练的详细信息请参考https://github.com/kotoba-tech/kotoba-whisper。蒸馏中使用的数据集和所有模型变体可以在https://huggingface.co/japanese-asr找到。

评估

以下代码片段展示了如何在CommonVoice 8.0的日语子集上评估kotoba-whisper模型：

import torch
from transformers import pipeline
from datasets import load_dataset
from evaluate import load
from transformers.models.whisper.english_normalizer import BasicTextNormalizer

# 模型配置
model_id = "kotoba-tech/kotoba-whisper-v2.0"
torch_dtype = torch.bfloat16 if torch.cuda.is_available() else torch.float32
device = "cuda:0" if torch.cuda.is_available() else "cpu"
model_kwargs = {"attn_implementation": "sdpa"} if torch.cuda.is_available() else {}
generate_kwargs = {"language": "japanese", "task": "transcribe"}
normalizer = BasicTextNormalizer()

# 数据配置
dataset_name = "japanese-asr/ja_asr.reazonspeech_test"
audio_column = 'audio'
text_column = 'transcription'

# 加载模型
pipe = pipeline(
    "automatic-speech-recognition",
    model=model_id,
    torch_dtype=torch_dtype,
    device=device,
    model_kwargs=model_kwargs,
    batch_size=16
)

# 加载数据集并以16kHz采样音频
dataset = load_dataset(dataset_name, split="test")
transcriptions = pipe(dataset['audio'])
transcriptions = [normalizer(i['text']).replace(" ", "") for i in transcriptions]
references = [normalizer(i).replace(" ", "") for i in dataset['transcription']]

# 计算CER指标
cer_metric = load("cer")
cer = 100 * cer_metric.compute(predictions=transcriptions, references=references)
print(cer)

主要日语ASR数据集的Hugging Face链接总结在这里。例如，要在JSUT Basic5000上评估模型，更改dataset_name：

- dataset_name = "japanese-asr/ja_asr.reazonspeech_test"
+ dataset_name = "japanese-asr/ja_asr.jsut_basic5000"

🔧 技术细节

Kotoba-Whisper 是一系列用于日语自动语音识别（ASR）的蒸馏Whisper模型，由Asahi Ushio和Kotoba Technologies合作开发。遵循distil-whisper的原始工作（Robust Knowledge Distillation via Large-Scale Pseudo Labelling），使用OpenAI的Whisper large-v3作为教师模型，学生模型由教师large-v3模型的完整编码器和从large-v3模型的第一层和最后一层初始化的两层解码器组成。Kotoba-Whisper比large-v3快6.3倍，同时保持与large-v3相同的低错误率。

评估指标

模型	CommonVoice 8 (日语测试集)	JSUT Basic 5000	ReazonSpeech (保留测试集)
kotoba-tech/kotoba-whisper-v2.0	9.2（CER），58.8（WER）	8.4（CER），63.7（WER）	11.6（CER），55.6（WER）
kotoba-tech/kotoba-whisper-v1.0	9.4（CER），59.2（WER）	8.5（CER），64.3（WER）	12.2（CER），56.4（WER）
openai/whisper-large-v3	8.5（CER），55.1（WER）	7.1（CER），59.2（WER）	14.9（CER），60.2（WER）
openai/whisper-large-v2	9.7（CER），59.3（WER）	8.2（CER），63.2（WER）	28.1（CER），74.1（WER）
openai/whisper-large	10（CER），61.1（WER）	8.9（CER），66.4（WER）	34.1（CER），74.9（WER）
openai/whisper-medium	11.5（CER），63.4（WER）	10（CER），69.5（WER）	33.2（CER），76（WER）
openai/whisper-base	28.6（CER），87.2（WER）	24.9（CER），93（WER）	70.4（CER），91.8（WER）
openai/whisper-small	15.1（CER），74.2（WER）	14.2（CER），81.9（WER）	41.5（CER），83（WER）
openai/whisper-tiny	53.7（CER），93.8（WER）	36.5（CER），97.6（WER）	137.9（CER），94.9（WER）