kb - whisper - large开源瑞典语语音识别模型，5万小时数据训练降低词错率

首页

Kb Whisper Large

由 KBLab 开发

瑞典国家图书馆发布的基于Whisper架构的瑞典语语音识别模型，训练数据超过5万小时，显著降低词错误率。

语音识别

Transformers

其他开源协议:Apache-2.0 #瑞典语语音识别 #低词错误率 #多格式支持

下载量 8,880

发布时间 : 2/14/2025

模型简介

专为瑞典语优化的语音识别模型，基于OpenAI Whisper架构，在多个瑞典语数据集上表现优异。

模型特点

显著降低词错误率

相比OpenAI原版模型，在瑞典语识别上平均降低47%的词错误率(WER)

多格式支持

提供Hugging Face、whisper.cpp(GGML)、onnx和ctranslate2多种格式的模型检查点

多版本转录风格

提供三种转录风格版本：字幕版(简洁)、标准版(默认)和严格版(逐字逐句)

大规模训练数据

基于超过5万小时的瑞典语音数据训练，分两个质量阶段进行训练

模型能力

瑞典语语音识别

带时间戳的语音转录

多格式推理支持

批处理语音转录

使用案例

语音转录

会议记录转录

将瑞典语会议录音转换为文字记录

高准确率的转录文本

字幕生成

为瑞典语视频内容生成字幕

带时间戳的字幕文件

语音分析

语音内容分析

分析瑞典语语音内容进行后续处理

结构化文本数据

🚀 KB-Whisper Large

瑞典国家图书馆发布了一套全新的Whisper模型，这些模型在超过50,000小时的瑞典语语音数据上进行了训练。在对FLEURS、CommonVoice和NST等数据集的评估中，我们表现最佳的模型与OpenAI的whisper-large-v3相比，平均将单词错误率（WER）降低了47%。较小尺寸的Whisper模型在瑞典语语音上的性能也有显著提升，其中kb-whisper-small的表现甚至超过了体积大其六倍的openai/whisper-large-v3。

🚀 快速开始

本项目提供了不同格式的检查点，包括Hugging Face、whisper.cpp（GGML）、onnx和ctranslate2（用于faster-whisper和WhisperX）。以下是不同方式的使用示例：

💻 使用示例

基础用法

以下是使用KB-Whisper与Hugging Face进行推理的示例代码：

import torch
from datasets import load_dataset
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "KBLab/kb-whisper-large"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, use_safetensors=True, cache_dir="cache"
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    torch_dtype=torch_dtype,
    device=device,
)

generate_kwargs = {"task": "transcribe", "language": "sv"}
# Add return_timestamps=True for output with timestamps
res = pipe("audio.mp3", 
           chunk_length_s=30,
           generate_kwargs={"task": "transcribe", "language": "sv"})
print(res)

高级用法

以下是使用faster-whisper、WhisperX、whisper.cpp / GGML和onnx (optimum)以及transformers.js的高级用法示例：

Faster-whisper

Faster-whisper通过使用ctranslate2重新实现Whisper，提供了快速高效的推理。

#### faster-whisper model ####
from faster_whisper import WhisperModel

model_id = "KBLab/kb-whisper-large"
model = WhisperModel(
    model_id,
    device="cuda",
    compute_type="float16",
    download_root="cache", # cache directory
    # condition_on_previous_text = False # Can reduce hallucinations if we don't use prompts
)

# Transcribe audio.wav (convert to 16khz mono wav first via ffmpeg)
segments, info = model.transcribe("audio.wav", condition_on_previous_text=False)
print("Detected language '%s' with probability %f" % (info.language, info.language_probability))

for segment in segments:
    print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))

WhisperX

WhisperX提供了一种方便的方法来获取准确的单词级时间戳。该库将Whisper的文本输出与Wav2vec2的准确时间戳相结合。以下是如何将KB-Whisper与KBLab/wav2vec2-large-voxrex-swedish一起使用的示例：

import whisperx

device = "cuda"
audio_file = "audio.wav"
batch_size = 16  # reduce if low on GPU mem
compute_type = "float16"  # change to "int8" if low on GPU mem (may reduce accuracy)

# 1. Transcribe with original whisper (batched)
model = whisperx.load_model(
    "KBLab/kb-whisper-large", device, compute_type=compute_type, download_root="cache"  # cache_dir
)

audio = whisperx.load_audio(audio_file)
result = model.transcribe(audio, batch_size=batch_size)
print(result["segments"])  # before alignment

# delete model if low on GPU resources
# import gc; gc.collect(); torch.cuda.empty_cache(); del model

# 2. Align whisper output
model_a, metadata = whisperx.load_align_model(
    language_code=result["language"],
    device=device,
    model_name="KBLab/wav2vec2-large-voxrex-swedish",
    model_dir="cache",  # cache_dir
)
result = whisperx.align(
    result["segments"], model_a, metadata, audio, device, return_char_alignments=False
)

print(result["segments"])  # word level timestamps after alignment

Whisper.cpp / GGML

我们提供了用于whisper.cpp和MacWhisper应用程序的GGML检查点。要使用whisper.cpp与我们的模型，首先克隆仓库并构建库：

git clone https://github.com/ggerganov/whisper.cpp.git
cd whisper.cpp
cmake -B build
cmake --build build --config Release

要使用该模型，你需要下载我们上传的GGML检查点之一。你可以点击此处的下载按钮，或者使用wget下载：

wget https://huggingface.co/KBLab/kb-whisper-large/resolve/main/ggml-model-q5_0.bin # Quantized version
# wget https://huggingface.co/KBLab/kb-whisper-large/resolve/main/ggml-model.bin # Non-quantized version

通过在参数-m后指定模型路径，并将音频文件的路径作为最后一个位置参数来运行推理：

./build/bin/whisper-cli -m ggml-model-q5_0.bin ../audio.wav

onnx (optimum) and transformers.js usage

你可以通过Hugging Face的optimum库以以下方式使用onnx检查点：

from optimum.onnxruntime import ORTModelForSpeechSeq2Seq
from transformers import AutoProcessor

model_id = "KBLab/kb-whisper-large"
processor = AutoProcessor.from_pretrained(model_id, cache_dir="cache")
model = ORTModelForSpeechSeq2Seq.from_pretrained(
    model_id,
    cache_dir="cache",
    subfolder="onnx",
)

import soundfile as sf
audio = sf.read("audio.wav")

inputs = processor.feature_extractor(audio[0], sampling_rate=16000, return_tensors="pt")
gen_tokens = model.generate(**inputs, max_length=300)
processor.decode(gen_tokens[0], skip_special_tokens=True)

一个使用transformers.js和KB-Whisper在浏览器中进行本地推理的应用程序示例可以在https://whisper.mesu.re/找到（由Pierre Mesure创建）。一个使用JavaScript设置此类应用程序的模板可以在https://github.com/xenova/whisper-web找到。

📚 详细文档

训练数据

我们的模型在超过50,000小时带有文本转录的瑞典语音频上进行了训练。模型分两个阶段进行训练，每个阶段的特点是应用了不同的质量过滤器和过滤器阈值。

第一阶段采用了较低的阈值（根据数据集，BLEU值在0到0.30之间），而第二阶段使用了更严格的阈值（BLEU >= 0.7，加权ROUGE-N >= 0.7，前10个和后10个字符的CER <= 0.2）。

数据集	持续预训练（小时） - 第一阶段	微调（小时） - 第二阶段
字幕	34,261	3,110
议会	21,949	5,119
ISOF	54	54
NST	250	250
总计	56,514	8,533

通过Hugging Face加载我们的模型时，默认使用第二阶段。不过，我们也上传了持续预训练的检查点并进行了标记。你可以通过在.from_pretrained()中指定revision来加载这些其他检查点。例如，预训练检查点的标签可以在pretrained-checkpoint找到。第二阶段的默认模型标签名为standard。我们还提供了一个不同的第二阶段检查点，其转录风格更简洁，名为subtitle。

评估

单词错误率（WER）

模型大小		FLEURS	CommonVoice	NST
tiny	KBLab	13.2	12.9	11.2
	OpenAI	59.2	67.8	85.2
base	KBLab	9.1	8.7	7.8
	OpenAI	39.6	52.1	53.4
small	KBLab	7.3	6.4	6.6
	OpenAI	20.6	26.4	26.4
medium	KBLab	6.6	5.4	5.8
	OpenAI	12.1	15.8	17.1
large-v3	KBLab	5.4	4.1	5.2
	OpenAI	7.8	9.5	11.3

BLEU分数

模型大小		FLEURS	CommonVoice	NST
tiny	KBLab	76.6	73.7	74.3
	OpenAI	26.9	21.1	24.0
base	KBLab	83.2	79.9	78.3
	OpenAI	41.1	32.5	36.9
small	KBLab	86.6	83.5	79.6
	OpenAI	64.0	56.5	58.2
medium	KBLab	87.6	85.0	80.2
	OpenAI	77.1	70.1	68.9
large-v3	KBLab	89.8	87.2	81.1
	OpenAI	84.9	79.1	75.1