Kotoba-Whisper-v1.1开源日语语音识别模型 - 免费处理标点与时间戳

首页

Kotoba Whisper V1.1

由 kotoba-tech 开发

Kotoba-Whisper-v1.1是基于Whisper的日语自动语音识别模型，增加了标点符号和时间戳处理功能。

语音识别

Transformers

日语开源协议:Apache-2.0 #日语语音识别 #标点符号自动添加 #低延迟推理

下载量 476

发布时间 : 4/29/2024

模型简介

这是一个日语自动语音识别（ASR）模型，基于Whisper架构，特别优化了日语语音转录，并集成了标点符号添加和时间戳处理功能。

模型特点

标点符号处理

集成了punctuators库，能够自动为转录文本添加标点符号。

时间戳处理

使用stable-ts库改进时间戳准确性。

日语优化

专门针对日语语音识别进行了优化。

高效推理

相比原始Whisper模型具有更快的推理速度。

模型能力

日语语音识别

自动标点符号添加

时间戳生成

长音频处理

使用案例

语音转录

会议记录转录

将日语会议录音转换为带标点符号的文本记录。

准确率优于原始Whisper模型

播客转录

将日语播客内容转录为带时间戳的文本。

支持长音频处理

语音分析

语音内容分析

分析日语语音内容的关键词和主题。

🚀 Kotoba-Whisper-v1.1

Kotoba-Whisper-v1.1 是一个基于日语的自动语音识别（ASR）模型。它在 kotoba-tech/kotoba-whisper-v1.0 的基础上，通过集成额外的后处理栈作为 pipeline 来增强功能，例如使用 punctuators 添加标点符号。该模型由 Asahi Ushio 和 Kotoba Technologies 合作开发。

✨ 主要特性

基于 kotoba-tech/kotoba-whisper-v1.0 模型，集成额外后处理栈。
可以通过 punctuators 为预测转录结果添加标点符号。
支持 Hugging Face 🤗 Transformers 库从 4.39 版本起的使用。

📦 安装指南

Kotoba-Whisper-v1.1 在 Hugging Face 🤗 Transformers 库 4.39 及更高版本中得到支持。要运行该模型，首先需要安装最新版本的 Transformers：

pip install --upgrade pip
pip install --upgrade transformers accelerate torchaudio
pip install stable-ts==2.16.0
pip install punctuators==0.0.5

如果你想使用 Flash-Attention 2，需要先安装 Flash Attention：

pip install flash-attn --no-build-isolation

💻 使用示例

基础用法

该模型可以使用 pipeline 类来转录音频文件，示例如下：

import torch
from transformers import pipeline
from datasets import load_dataset

# 配置
model_id = "kotoba-tech/kotoba-whisper-v1.1"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
device = "cuda:0" if torch.cuda.is_available() else "cpu"
model_kwargs = {"attn_implementation": "sdpa"} if torch.cuda.is_available() else {}
generate_kwargs = {"language": "ja", "task": "transcribe"}

# 加载模型
pipe = pipeline(
    model=model_id,
    torch_dtype=torch_dtype,
    device=device,
    model_kwargs=model_kwargs,
    batch_size=16,
    trust_remote_code=True,
    punctuator=True
)

# 加载示例音频
dataset = load_dataset("japanese-asr/ja_asr.reazonspeech_test", split="test")
sample = dataset[0]["audio"]

# 运行推理
result = pipe(sample, chunk_length_s=15, return_timestamps=True, generate_kwargs=generate_kwargs)
print(result)

高级用法

转录本地音频文件

要转录本地音频文件，只需在调用管道时传入音频文件的路径：

- result = pipe(sample, return_timestamps=True, generate_kwargs=generate_kwargs)
+ result = pipe("audio.mp3", return_timestamps=True, generate_kwargs=generate_kwargs)

停用标点符号器

要停用标点符号器，可进行如下修改：

-     punctuator=True,
+     punctuator=False,

带提示的转录

Kotoba-whisper 可以通过提示生成转录，示例如下：

import re
import torch
from transformers import pipeline
from datasets import load_dataset

# 配置
model_id = "kotoba-tech/kotoba-whisper-v1.1"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
device = "cuda:0" if torch.cuda.is_available() else "cpu"
model_kwargs = {"attn_implementation": "sdpa"} if torch.cuda.is_available() else {}
generate_kwargs = {"language": "japanese", "task": "transcribe"}

# 加载模型
pipe = pipeline(
    model=model_id,
    torch_dtype=torch_dtype,
    device=device,
    model_kwargs=model_kwargs,
    batch_size=16,
    trust_remote_code=True
)

# 加载示例音频
dataset = load_dataset("japanese-asr/ja_asr.reazonspeech_test", split="test")

# --- 无提示 ---
text = pipe(dataset[10]["audio"], chunk_length_s=15, generate_kwargs=generate_kwargs)['text']
print(text)
# 81歳、力強い走りに変わってきます。

# --- 有提示 ---: 把 `81` 改为 `91`。
prompt = "91歳"
generate_kwargs['prompt_ids'] = pipe.tokenizer.get_prompt_ids(prompt, return_tensors="pt").to(device)
text = pipe(dataset[10]["audio"], generate_kwargs=generate_kwargs)['text']
# 目前 ASR 管道会在转录开头添加提示，所以将其移除
text = re.sub(rf"\A\s*{prompt}\s*", "", text)
print(text)
# あっぶったでもスルガさん、91歳、力強い走りに変わってきます。

使用 Flash Attention 2

如果你的 GPU 支持，建议使用 Flash-Attention 2。要使用它，需要在 from_pretrained 中传入 attn_implementation="flash_attention_2"：

- model_kwargs = {"attn_implementation": "sdpa"} if torch.cuda.is_available() else {}
+ model_kwargs = {"attn_implementation": "flash_attention_2"} if torch.cuda.is_available() else {}

📚 详细文档

原始字符错误率（CER）

以下表格展示了原始字符错误率（CER），与通常在计算指标前去除标点符号的 CER 不同，具体评估脚本可查看此处：

模型	CommonVoice 8 (日语测试集)	JSUT Basic 5000	ReazonSpeech (保留测试集)
kotoba-tech/kotoba-whisper-v2.0	17.6	15.4	17.4
kotoba-tech/kotoba-whisper-v2.1	17.7	15.4	17
kotoba-tech/kotoba-whisper-v1.0	17.8	15.2	17.8
kotoba-tech/kotoba-whisper-v1.1	17.9	15	17.8
openai/whisper-large-v3	15.3	13.4	20.5
openai/whisper-large-v2	15.9	10.6	34.6
openai/whisper-large	16.6	11.3	40.7
openai/whisper-medium	17.9	13.1	39.3
openai/whisper-base	34.5	26.4	76
openai/whisper-small	21.5	18.9	48.1
openai/whisper-tiny	58.8	38.3	153.3

关于归一化 CER，由于 v1.1 的更新会在归一化过程中被移除，因此 kotoba-tech/kotoba-whisper-v1.1 的 CER 值与 kotoba-tech/kotoba-whisper-v1.0 相同。

延迟

Kotoba-whisper-v1.1 改进了 Kotoba-whisper-v1.0 输出的标点符号和时间戳。然而，由于需要对每个块应用标点符号器和 stable-ts 来获取时间戳，这会降低原始 kotoba-whisper-v1.0 的推理速度。以下表格比较了转录 50 分钟 日语语音音频的推理速度，结果是五次独立运行的平均值：

模型	return_timestamps	时间（均值）
kotoba-tech/kotoba-whisper-v1.0	False	10.8
kotoba-tech/kotoba-whisper-v1.0	True	15.7
kotoba-tech/kotoba-whisper-v1.1 (punctuator + stable-ts)	True	17.9
kotoba-tech/kotoba-whisper-v1.1 (punctuator)	True	17.7
kotoba-tech/kotoba-whisper-v1.1 (stable-ts)	True	16.1
openai/whisper-large-v3	False	29.1
openai/whisper-large-v3	True	37.9

完整表格可查看此处。

🔧 技术细节

Kotoba-Whisper-v1.1 基于 kotoba-tech/kotoba-whisper-v1.0 模型，通过集成额外的后处理栈作为 pipeline 来增强功能。具体来说，使用了 punctuators 库来为预测转录结果添加标点符号。这些库通过管道合并到 Kotoba-Whisper-v1.1 中，并将无缝应用于 kotoba-tech/kotoba-whisper-v1.0 的预测转录结果。