kb-whisper-small开源语音模型 - 专为瑞典语优化，性能超OpenAI原版

首页

Kb Whisper Small

由 KBLab 开发

瑞典国家图书馆发布的Whisper模型，专为瑞典语优化，在50,000+小时瑞典语音数据上训练，性能超越OpenAI原版

语音识别

Transformers

其他开源协议:Apache-2.0 #瑞典语语音识别 #低词错误率 #多格式支持

下载量 28.61k

发布时间 : 2/14/2025

模型简介

基于OpenAI Whisper架构优化的瑞典语自动语音识别(ASR)模型，显著降低词错误率(WER)，支持多种推理格式

模型特点

瑞典语优化

专门针对瑞典语进行优化训练，词错误率比OpenAI原版降低47%

多格式支持

提供Hugging Face、GGML、ONNX和ctranslate2多种推理格式

两阶段训练

采用两阶段训练策略，第一阶段低阈值过滤，第二阶段严格质量过滤

转录风格可选

提供三种转录风格：简洁的字幕版、平衡的标准版和详细的严格版

模型能力

瑞典语语音识别

带时间戳的转录

语音内容分析

多格式推理支持

使用案例

语音转录

会议记录

将瑞典语会议录音自动转录为文字记录

词错误率低至6.4%（CommonVoice数据集）

媒体字幕生成

为瑞典语视频内容自动生成字幕

支持字幕专用优化版本(revision=subtitle)

语音分析

语音内容分析

分析瑞典语语音内容并提取关键信息

🚀 KB-Whisper Small

瑞典国家图书馆发布了一套全新的Whisper模型，这些模型在超过50,000小时的瑞典语语音数据上进行了训练。在对FLEURS、CommonVoice和NST等数据集的评估中，我们表现最佳的模型与OpenAI的whisper-large-v3相比，平均将单词错误率（WER）降低了47%。较小尺寸的Whisper模型在瑞典语语音上的性能也有了显著提升，其中kb-whisper-small的表现甚至超过了openai/whisper-large-v3（后者的模型大小是前者的六倍）。

✨ 主要特性

性能卓越：在多个瑞典语语音数据集上评估，相比OpenAI的whisper-large-v3，最佳模型平均降低47%的单词错误率（WER），小尺寸模型也有显著提升。
多格式支持：提供Hugging Face、whisper.cpp（GGML）、onnx和ctranslate2等不同格式的检查点。
多种转录风格：除默认的转录风格外，还有更简洁的Subtitle和更逐字的Strict两种风格可供选择。

📦 安装指南

Whisper.cpp

若要使用whisper.cpp运行模型，需先克隆仓库并构建库：

git clone https://github.com/ggerganov/whisper.cpp.git
cd whisper.cpp
cmake -B build
cmake --build build --config Release

然后下载GGML检查点，可点击此处的下载按钮，或使用wget命令：

wget https://huggingface.co/KBLab/kb-whisper-small/resolve/main/ggml-model-q5_0.bin # 量化版本
# wget https://huggingface.co/KBLab/kb-whisper-small/resolve/main/ggml-model.bin # 非量化版本

💻 使用示例

基础用法

Hugging Face

使用Hugging Face调用KB-Whisper的推理示例：

import torch
from datasets import load_dataset
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "KBLab/kb-whisper-small"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, use_safetensors=True, cache_dir="cache"
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    torch_dtype=torch_dtype,
    device=device,
)

generate_kwargs = {"task": "transcribe", "language": "sv"}
# 添加 return_timestamps=True 以输出带时间戳的结果
res = pipe("audio.mp3", 
           chunk_length_s=30,
           generate_kwargs={"task": "transcribe", "language": "sv"})
print(res)

高级用法

Faster-whisper

Faster-whisper通过使用ctranslate2重新实现Whisper，提供快速高效的推理：

#### faster-whisper 模型 ####
from faster_whisper import WhisperModel

model_id = "KBLab/kb-whisper-small"
model = WhisperModel(
    model_id,
    device="cuda",
    compute_type="float16",
    download_root="cache", # 缓存目录
    # condition_on_previous_text = False # 如果不使用提示，可以减少幻觉
)

# 转录 audio.wav（先通过 ffmpeg 转换为 16khz 单声道 wav）
segments, info = model.transcribe("audio.wav", condition_on_previous_text=False)
print("检测到的语言 '%s'，概率为 %f" % (info.language, info.language_probability))

for segment in segments:
    print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))

WhisperX

WhisperX提供了一种方便的方法来获取准确的单词级时间戳。该库将Whisper的文本输出与Wav2vec2的准确时间戳相结合。以下是如何将KB-Whisper与KBLab/wav2vec2-large-voxrex-swedish一起使用的示例：

import whisperx

device = "cuda"
audio_file = "audio.wav"
batch_size = 16  # 如果 GPU 内存不足，请减小该值
compute_type = "float16"  # 如果 GPU 内存不足，可改为 "int8"（可能会降低准确性）

# 1. 使用原始的 whisper 进行转录（批量处理）
model = whisperx.load_model(
    "KBLab/kb-whisper-small", device, compute_type=compute_type, download_root="cache"  # 缓存目录
)

audio = whisperx.load_audio(audio_file)
result = model.transcribe(audio, batch_size=batch_size)
print(result["segments"])  # 对齐前的结果

# 如果 GPU 资源不足，可删除模型
# import gc; gc.collect(); torch.cuda.empty_cache(); del model

# 2. 对齐 whisper 的输出
model_a, metadata = whisperx.load_align_model(
    language_code=result["language"],
    device=device,
    model_name="KBLab/wav2vec2-large-voxrex-swedish",
    model_dir="cache",  # 缓存目录
)
result = whisperx.align(
    result["segments"], model_a, metadata, audio, device, return_char_alignments=False
)

print(result["segments"])  # 对齐后的单词级时间戳

onnx (optimum) 和 transformers.js 使用方法

可以通过Hugging Face的optimum库以以下方式使用onnx检查点：

from optimum.onnxruntime import ORTModelForSpeechSeq2Seq
from transformers import AutoProcessor

model_id = "KBLab/kb-whisper-small"
processor = AutoProcessor.from_pretrained(model_id, cache_dir="cache")
model = ORTModelForSpeechSeq2Seq.from_pretrained(
    model_id,
    cache_dir="cache",
    subfolder="onnx",
)

import soundfile as sf
audio = sf.read("audio.wav")

inputs = processor.feature_extractor(audio[0], sampling_rate=16000, return_tensors="pt")
gen_tokens = model.generate(**inputs, max_length=300)
processor.decode(gen_tokens[0], skip_special_tokens=True)

一个使用transformers.js和KB-Whisper在浏览器中进行本地推理的应用示例可在https://whisper.mesu.re/找到（由Pierre Mesure创建）。使用JavaScript设置此类应用的模板可在https://github.com/xenova/whisper-web找到。

📚 详细文档

模型性能对比

单词错误率（WER）

模型大小		FLEURS	CommonVoice	NST
tiny	KBLab	13.2	12.9	11.2
	OpenAI	59.2	67.8	85.2
base	KBLab	9.1	8.7	7.8
	OpenAI	39.6	52.1	53.4
small	KBLab	7.3	6.4	6.6
	OpenAI	20.6	26.4	26.4
medium	KBLab	6.6	5.4	5.8
	OpenAI	12.1	15.8	17.1
large-v3	KBLab	5.4	4.1	5.2
	OpenAI	7.8	9.5	11.3

BLEU分数

模型大小		FLEURS	CommonVoice	NST
tiny	KBLab	76.6	73.7	74.3
	OpenAI	26.9	21.1	24.0
base	KBLab	83.2	79.9	78.3
	OpenAI	41.1	32.5	36.9
small	KBLab	86.6	83.5	79.6
	OpenAI	64.0	56.5	58.2
medium	KBLab	87.6	85.0	80.2
	OpenAI	77.1	70.1	68.9
large-v3	KBLab	89.8	87.2	81.1
	OpenAI	84.9	79.1	75.1

训练数据

我们的模型在超过50,000小时的带有文本转录的瑞典语音频上进行了训练。模型分两个阶段进行训练，每个阶段的特点是应用不同的质量过滤器和阈值：

阶段1：采用较低的阈值（根据数据集不同，BLEU值在0到0.30之间）。
阶段2：使用更严格的阈值（BLEU >= 0.7，加权ROUGE-N >= 0.7，首尾10个字符的CER <= 0.2）。

数据集	阶段1持续预训练（小时）	阶段2微调（小时）
字幕	34,261	3,110
议会	21,949	5,119
ISOF	54	54
NST	250	250
总计	56,514	8,533

通过Hugging Face加载我们的模型时，默认使用阶段2的模型。不过，我们也上传了持续预训练的检查点并进行了标记。你可以通过在.from_pretrained()中指定revision来加载这些其他检查点。例如，预训练检查点的标签可以在pretrained-checkpoint找到。阶段2的默认模型标签名为standard。我们还提供了一个不同的阶段2检查点，其转录风格更简洁，名为subtitle。

🔧 技术细节

模型更新说明（2025-05-13）

通过Hugging Face加载我们的模型时，默认使用阶段2的模型。截至2025年5月，除默认版本外，还有两个阶段2的版本，即Subtitle和Strict，它们指定了转录风格。通过在.from_pretrained()中指定revision="subtitle"，可以访问转录风格更简洁的模型版本；通过指定revision="strict"，可以访问更逐字的模型版本。以下是如何在.from_pretrained()函数中传递此参数的示例：

import torch
from datasets import load_dataset
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "KBLab/kb-whisper-small"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, use_safetensors=True, cache_dir="cache", revision="strict"
)