kb-whisper-tiny开源语音识别模型 - 免费部署，大幅降低瑞典语识别错误率

首页

Kb Whisper Tiny

由 KBLab 开发

瑞典国家图书馆发布的Whisper模型，专为瑞典语语音识别优化，相比OpenAI原版显著降低错误率。

语音识别

Transformers

其他开源协议:Apache-2.0 #瑞典语语音识别 #低WER优化 #多格式推理支持

下载量 1,791

发布时间 : 2/14/2025

模型简介

基于Whisper架构优化的瑞典语语音识别模型，在超过50,000小时瑞典语数据上训练，提供多种推理格式和转录风格。

模型特点

高性能瑞典语识别

相比OpenAI whisper-large-v3平均降低47%单词错误率(WER)

多格式支持

提供Hugging Face/whisper.cpp/ONNX/ctranslate2等多种推理格式

多风格转录

提供subtitle(简洁)/standard(默认)/strict(详细)三种转录风格

大规模训练数据

使用56,514小时瑞典语数据预训练+8,533小时精细调优

模型能力

瑞典语语音转文本

带时间戳的语音识别

多风格文本转录

使用案例

语音转录

瑞典语会议记录

将瑞典语会议录音转为带时间戳的文本记录

WER低至11.2%（NST数据集）

媒体字幕生成

为瑞典语视频内容自动生成字幕

提供subtitle风格优化输出

语音分析

语音数据标注

辅助标注瑞典语语音数据集

BLEU分数最高达89.8（FLEURS数据集）

🚀 KB-Whisper Tiny

瑞典国家图书馆发布了一套全新的Whisper模型，这些模型在超过50,000小时的瑞典语语音数据上进行了训练。在对FLEURS、CommonVoice和NST等数据集的评估中，我们表现最佳的模型与OpenAI的whisper-large-v3相比，平均将单词错误率（WER）降低了47%。较小尺寸的Whisper模型在瑞典语语音上的性能也有了显著提升，其中kb-whisper-small的表现甚至超过了openai/whisper-large-v3（后者的规模是前者的六倍）。

🚀 快速开始

本项目提供了不同格式的检查点，可用于不同的推理场景。以下是使用不同工具和库进行推理的示例。

✨ 主要特性

高性能：在瑞典语语音识别任务中，相比OpenAI的whisper-large-v3，我们的模型大幅降低了单词错误率（WER）。
多格式支持：提供Hugging Face、whisper.cpp（GGML）、onnx和ctranslate2等不同格式的检查点。
多版本可选：有不同的转录风格版本可供选择，如subtitle（更简洁）和strict（更详细）。

📚 详细文档

模型信息

属性	详情
库名称	transformers
基础模型	openai/whisper-tiny
支持语言	瑞典语（sv）
任务类型	自动语音识别
许可证	apache-2.0
训练数据集	KBLab/rixvox-v2
标签	ctranslate2

2025-05-13更新说明

通过Hugging Face加载我们的模型时，默认使用的是Stage 2版本。截至2025年5月，除了默认版本外，还有两个Stage 2版本，即Subtitle和Strict，它们代表了不同的转录风格。

在.from_pretrained()中指定revision="subtitle"，可以使用更简洁的转录风格版本。
在.from_pretrained()中指定revision="strict"，可以使用更详细的转录风格版本。

以下是在.from_pretrained()函数中传递该参数的示例：

import torch
from datasets import load_dataset
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "KBLab/kb-whisper-tiny"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, use_safetensors=True, cache_dir="cache", revision="strict"
)

这三个模型版本的转录风格详细程度从低到高依次为：Subtitle、Stage 2（默认）和Strict。

使用方法

我们提供了不同格式的检查点：Hugging Face、whisper.cpp（GGML）、onnx和ctranslate2（用于faster-whisper和WhisperX）。

Hugging Face

使用KB-Whisper与Hugging Face进行推理的示例：

import torch
from datasets import load_dataset
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "KBLab/kb-whisper-tiny"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, use_safetensors=True, cache_dir="cache"
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    torch_dtype=torch_dtype,
    device=device,
)

generate_kwargs = {"task": "transcribe", "language": "sv"}
# 添加return_timestamps=True以输出带时间戳的结果
res = pipe("audio.mp3", 
           chunk_length_s=30,
           generate_kwargs={"task": "transcribe", "language": "sv"})
print(res)

Faster-whisper

Faster-whisper通过使用ctranslate2重新实现Whisper，提供了快速高效的推理。

#### faster-whisper模型 ####
from faster_whisper import WhisperModel

model_id = "KBLab/kb-whisper-tiny"
model = WhisperModel(
    model_id,
    device="cuda",
    compute_type="float16",
    download_root="cache", # 缓存目录
    # condition_on_previous_text = False # 如果不使用提示，可以减少幻觉
)

# 转录audio.wav（先通过ffmpeg将其转换为16khz单声道wav）
segments, info = model.transcribe("audio.wav", condition_on_previous_text=False)
print("检测到的语言为 '%s'，概率为 %f" % (info.language, info.language_probability))

for segment in segments:
    print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))

WhisperX

WhisperX提供了一种方便的方法来获取准确的单词级时间戳。该库将Whisper的文本输出与Wav2vec2的准确时间戳相结合。以下是如何将KB-Whisper与KBLab/wav2vec2-large-voxrex-swedish一起使用的示例：

import whisperx

device = "cuda"
audio_file = "audio.wav"
batch_size = 16  # 如果GPU内存不足，可以减小该值
compute_type = "float16"  # 如果GPU内存不足，可以将其改为 "int8"（可能会降低准确性）

# 1. 使用原始的whisper进行转录（批量处理）
model = whisperx.load_model(
    "KBLab/kb-whisper-tiny", device, compute_type=compute_type, download_root="cache"  # 缓存目录
)

audio = whisperx.load_audio(audio_file)
result = model.transcribe(audio, batch_size=batch_size)
print(result["segments"])  # 对齐前的结果

# 如果GPU资源不足，可以删除模型
# import gc; gc.collect(); torch.cuda.empty_cache(); del model

# 2. 对齐whisper的输出
model_a, metadata = whisperx.load_align_model(
    language_code=result["language"],
    device=device,
    model_name="KBLab/wav2vec2-large-voxrex-swedish",
    model_dir="cache",  # 缓存目录
)
result = whisperx.align(
    result["segments"], model_a, metadata, audio, device, return_char_alignments=False
)

print(result["segments"])  # 对齐后的单词级时间戳

Whisper.cpp / GGML

我们提供了可用于whisper.cpp和MacWhisper应用程序的GGML检查点。要使用我们的模型与whisper.cpp，首先克隆仓库并构建库：

git clone https://github.com/ggerganov/whisper.cpp.git
cd whisper.cpp
cmake -B build
cmake --build build --config Release

要使用该模型，你需要下载我们上传的GGML检查点之一。你可以点击此处的下载按钮，或者使用wget进行下载：

wget https://huggingface.co/KBLab/kb-whisper-tiny/resolve/main/ggml-model-q5_0.bin # 量化版本
# wget https://huggingface.co/KBLab/kb-whisper-tiny/resolve/main/ggml-model.bin # 非量化版本

通过在参数-m后指定模型路径，并将音频文件的路径作为最后一个位置参数来运行推理：

./build/bin/whisper-cli -m ggml-model-q5_0.bin ../audio.wav

onnx (optimum)和transformers.js的使用

你可以通过Hugging Face的optimum库以以下方式使用onnx检查点：

from optimum.onnxruntime import ORTModelForSpeechSeq2Seq
from transformers import AutoProcessor

model_id = "KBLab/kb-whisper-tiny"
processor = AutoProcessor.from_pretrained(model_id, cache_dir="cache")
model = ORTModelForSpeechSeq2Seq.from_pretrained(
    model_id,
    cache_dir="cache",
    subfolder="onnx",
)

import soundfile as sf
audio = sf.read("audio.wav")

inputs = processor.feature_extractor(audio[0], sampling_rate=16000, return_tensors="pt")
gen_tokens = model.generate(**inputs, max_length=300)
processor.decode(gen_tokens[0], skip_special_tokens=True)

一个使用transformers.js和KB-Whisper在浏览器中进行本地推理的应用程序示例可以在https://whisper.mesu.re/找到（由Pierre Mesure创建）。一个使用JavaScript设置此类应用程序的模板可以在https://github.com/xenova/whisper-web找到。

训练数据

我们的模型在超过50,000小时的带有文本转录的瑞典语音频上进行了训练。模型分两个阶段进行训练，每个阶段的特点是应用了不同的质量过滤器和相应的阈值。

阶段1使用了较低的阈值（根据数据集的不同，BLEU值在0到0.30之间）。
阶段2使用了更严格的阈值（BLEU >= 0.7，加权ROUGE-N >= 0.7，前10个和后10个字符的CER <= 0.2）。

数据集	阶段1 - 继续预训练（小时）	阶段2 - 微调（小时）
字幕	34,261	3,110
瑞典议会	21,949	5,119
ISOF	54	54
NST	250	250
总计	56,514	8,533

通过Hugging Face加载我们的模型时，默认使用的是Stage 2版本。不过，我们也上传了继续预训练的检查点并进行了标记。你可以在.from_pretrained()中指定revision来加载这些其他检查点。例如，预训练检查点的标签可以在pretrained-checkpoint找到。阶段2的默认模型标签名为standard。我们提供了两个不同的阶段2检查点，一个转录风格更简洁，名为subtitle，另一个更详细，名为strict。

评估

与OpenAI模型的WER比较

模型大小		FLEURS	CommonVoice	NST
tiny	KBLab	13.2	12.9	11.2
	OpenAI	59.2	67.8	85.2
base	KBLab	9.1	8.7	7.8
	OpenAI	39.6	52.1	53.4
small	KBLab	7.3	6.4	6.6
	OpenAI	20.6	26.4	26.4
medium	KBLab	6.6	5.4	5.8
	OpenAI	12.1	15.8	17.1
large-v3	KBLab	5.4	4.1	5.2
	OpenAI	7.8	9.5	11.3

不同KBLab Stage 2版本的WER

模型大小		FLEURS	CommonVoice	NST
tiny	standard	13.2	12.9	11.2
	strict	14.1	13.4	11.0
	subtitle	13.3	12.9	11.4
base	standard	9.1	8.7	7.8
	strict	10.4	9.6	8.4
	subtitle	9.1	8.7	7.9
small	standard	7.3	6.4	6.6
	strict	8.2	7.0	6.7
	subtitle	7.3	6.4	6.6
medium	standard	6.6	5.4	5.8
	strict	6.8	5.4	6.0
large-v3	standard	5.4	4.1	5.2
	strict	5.3	4.0	5.1
	subtitle	5.3	4.1	5.3

与OpenAI模型的BLEU分数比较

模型大小		FLEURS	CommonVoice	NST
tiny	KBLab	76.6	73.7	74.3
	OpenAI	26.9	21.1	24.0
base	KBLab	83.2	79.9	78.3
	OpenAI	41.1	32.5	36.9
small	KBLab	86.6	83.5	79.6
	OpenAI	64.0	56.5	58.2
medium	KBLab	87.6	85.0	80.2
	OpenAI	77.1	70.1	68.9
large-v3	KBLab	89.8	87.2	81.1
	OpenAI	84.9	79.1	75.1

不同KBLab Stage 2版本的BLEU分数

模型大小		FLEURS	CommonVoice	NST
tiny	standard	76.6	73.7	74.3
	strict	75.3	72.9	74.6
	subtitle	76.6	73.7	74.1
base	standard	83.2	79.9	78.3
	strict	81.0	78.4	77.5
	subtitle	83.2	79.8	78.2
small	standard	86.6	83.5	79.6
	strict	84.9	82.4	79.3
	subtitle	86.6	83.5	79.6
medium	standard	87.6	85.0	80.2
	strict	87.3	84.9	80.1
large-v3	standard	89.8	87.2	81.1
	strict	90.0	87.4	81.2
	subtitle	89.8	87.3	81.0