kb - whisper - largeオープンソースのスウェーデン語音声認識モデル、5万時間のデータで訓練して単語誤り率を低下させます

ホーム

Kb Whisper Large

KBLabによって開発

スウェーデン国立図書館が公開したWhisperアーキテクチャに基づくスウェーデン語音声認識モデルで、5万時間以上のトレーニングデータを使用し、単語誤り率を大幅に低減しています。

音声認識

Transformers

その他オープンソースライセンス:Apache-2.0 #スウェーデン語音声認識 #低い単語誤り率 #マルチフォーマット対応

ダウンロード数 8,880

リリース時間 : 2/14/2025

モデル概要

スウェーデン語に最適化された音声認識モデルで、OpenAI Whisperアーキテクチャに基づき、複数のスウェーデン語データセットで優れた性能を発揮します。

モデル特徴

単語誤り率の大幅な低減

OpenAIのオリジナルモデルと比較して、スウェーデン語認識において平均47%の単語誤り率(WER)を低減

マルチフォーマット対応

Hugging Face、whisper.cpp(GGML)、onnx、ctranslate2など複数の形式のモデルチェックポイントを提供

複数バージョンの文字起こしスタイル

字幕版(簡潔)、標準版(デフォルト)、厳密版(逐語的)の3種類の文字起こしスタイルバージョンを提供

大規模トレーニングデータ

5万時間以上のスウェーデン語音声データに基づくトレーニングを実施し、2つの品質段階でトレーニングを実施

モデル能力

スウェーデン語音声認識

タイムスタンプ付き音声文字起こし

マルチフォーマット推論サポート

バッチ処理音声文字起こし

使用事例

音声文字起こし

会議議事録の文字起こし

スウェーデン語の会議録音をテキスト記録に変換

高精度な文字起こしテキスト

字幕生成

スウェーデン語の動画コンテンツに字幕を生成

タイムスタンプ付き字幕ファイル

音声分析

音声コンテンツ分析

スウェーデン語音声コンテンツを分析して後処理を実施

構造化されたテキストデータ

library_name: transformers base_model: openai/whisper-large-v3 language:

sv pipeline_tag: automatic-speech-recognition license: apache-2.0 datasets:
KBLab/rixvox-v2 tags:
ctranslate2

KB-Whisper Large

The National Library of Sweden releases a new suite of Whisper models trained on over 50,000 hours of Swedish speech. In evaluations across FLEURS, CommonVoice and NST, our best performing model reduces the Word Error Rate (WER) by an average of 47% compared to OpenAI's whisper-large-v3. The performance of smaller Whisper model sizes on Swedish speech has also substantially improved, with kb-whisper-small outperforming openai/whisper-large-v3 (a model six times its size).

Model size		FLEURS	CommonVoice	NST
tiny	KBLab	13.2	12.9	11.2
	OpenAI	59.2	67.8	85.2
base	KBLab	9.1	8.7	7.8
	OpenAI	39.6	52.1	53.4
small	KBLab	7.3	6.4	6.6
	OpenAI	20.6	26.4	26.4
medium	KBLab	6.6	5.4	5.8
	OpenAI	12.1	15.8	17.1
large-v3	KBLab	5.4	4.1	5.2
	OpenAI	7.8	9.5	11.3

Table: Word Error Rate (WER) comparison between KBLab's Whisper models and the corresponding OpenAI versions.

Usage

We provide checkpoints in different formats: Hugging Face, whisper.cpp (GGML), onnx, and ctranslate2 (used in faster-whisper and WhisperX).

2025-05-13 Update!

The default when loading our models through Hugging Face is Stage 2. As of May 2025 there exists two Stage 2 versions in addition to the default, namely Subtitle and Strict that specify the transcription style. By specifying revision="subtitle" in .from_pretrained() the model version with a more condensed style of transcribing is accessed. By specifying revision="strict" in .from_pretrained() the more verbatim-like version of the model is accessed. Below is an example of how this argument is passed in the .from_pretrained() function

import torch
from datasets import load_dataset
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "KBLab/kb-whisper-large"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, use_safetensors=True, cache_dir="cache", revision="strict"
)

The verbosity of the transcription styles of the three model versions ranges from the least verbose Subtitle, to Stage 2 (default) to the most verbose Strict.

Hugging Face

Inference example for using KB-Whisper with Hugging Face:

import torch
from datasets import load_dataset
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "KBLab/kb-whisper-large"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, use_safetensors=True, cache_dir="cache"
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    torch_dtype=torch_dtype,
    device=device,
)

generate_kwargs = {"task": "transcribe", "language": "sv"}
# Add return_timestamps=True for output with timestamps
res = pipe("audio.mp3", 
           chunk_length_s=30,
           generate_kwargs={"task": "transcribe", "language": "sv"})
print(res)

Faster-whisper

Faster-whisper provides fast and efficient inference via a reimplementation of Whisper using ctranslate2.

#### faster-whisper model ####
from faster_whisper import WhisperModel

model_id = "KBLab/kb-whisper-large"
model = WhisperModel(
    model_id,
    device="cuda",
    compute_type="float16",
    download_root="cache", # cache directory
    # condition_on_previous_text = False # Can reduce hallucinations if we don't use prompts
)

# Transcribe audio.wav (convert to 16khz mono wav first via ffmpeg)
segments, info = model.transcribe("audio.wav", condition_on_previous_text=False)
print("Detected language '%s' with probability %f" % (info.language, info.language_probability))

for segment in segments:
    print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))

WhisperX

WhisperX provides a convenient method of getting accurate word level timestamps. The library combines (force aligns) the text output of Whisper with the accurate timestamps of Wav2vec2. We provide an example below of how to use KB-Whisper together with KBLab/wav2vec2-large-voxrex-swedish.

import whisperx

device = "cuda"
audio_file = "audio.wav"
batch_size = 16  # reduce if low on GPU mem
compute_type = "float16"  # change to "int8" if low on GPU mem (may reduce accuracy)

# 1. Transcribe with original whisper (batched)
model = whisperx.load_model(
    "KBLab/kb-whisper-large", device, compute_type=compute_type, download_root="cache"  # cache_dir
)

audio = whisperx.load_audio(audio_file)
result = model.transcribe(audio, batch_size=batch_size)
print(result["segments"])  # before alignment

# delete model if low on GPU resources
# import gc; gc.collect(); torch.cuda.empty_cache(); del model

# 2. Align whisper output
model_a, metadata = whisperx.load_align_model(
    language_code=result["language"],
    device=device,
    model_name="KBLab/wav2vec2-large-voxrex-swedish",
    model_dir="cache",  # cache_dir
)
result = whisperx.align(
    result["segments"], model_a, metadata, audio, device, return_char_alignments=False
)

print(result["segments"])  # word level timestamps after alignment

Whisper.cpp / GGML

We provide GGML checkpoints used in the apps whisper.cpp and MacWhisper. To use our model with whisper.cpp first clone the repository and build the library:

git clone https://github.com/ggerganov/whisper.cpp.git
cd whisper.cpp
cmake -B build
cmake --build build --config Release

To use the model you need to download one of the GGML checkpoints we have uploaded. You can either press the download buttons here, or download using wget:

wget https://huggingface.co/KBLab/kb-whisper-large/resolve/main/ggml-model-q5_0.bin # Quantized version
# wget https://huggingface.co/KBLab/kb-whisper-large/resolve/main/ggml-model.bin # Non-quantized version

Run inference by specifying the model path after the argument -m, along with the path to the audio file as the last positional argument.

./build/bin/whisper-cli -m ggml-model-q5_0.bin ../audio.wav

onnx (optimum) and transformers.js usage

You can use the onnx checkpoints via Hugging Face's optimum library in the following manner:

from optimum.onnxruntime import ORTModelForSpeechSeq2Seq
from transformers import AutoProcessor

model_id = "KBLab/kb-whisper-large"
processor = AutoProcessor.from_pretrained(model_id, cache_dir="cache")
model = ORTModelForSpeechSeq2Seq.from_pretrained(
    model_id,
    cache_dir="cache",
    subfolder="onnx",
)

import soundfile as sf
audio = sf.read("audio.wav")

inputs = processor.feature_extractor(audio[0], sampling_rate=16000, return_tensors="pt")
gen_tokens = model.generate(**inputs, max_length=300)
processor.decode(gen_tokens[0], skip_special_tokens=True)

An example of an app that runs inference locally in the browser with transformers.js and KB-Whisper can be found at https://whisper.mesu.re/ (created by Pierre Mesure). A template for setting up such an app with javascript can be found at https://github.com/xenova/whisper-web.

Training data

Our models have been trained on over 50,000 hours of Swedish audio with text transcriptions. The models were trained in 2 stages, each characterized by the application of different quality filters and thresholds for said filters.

Stage 1 employed low threshold values (0 to 0.30 BLEU depending on dataset), whereas Stage 2 used stricter thresholds (BLEU >= 0.7, weighted ROUGE-N >= 0.7, CER of first and last 10 characters <= 0.2).

Dataset	Continued pretraining (h) -- Stage 1	Finetuning (h) -- Stage 2
Subtitles	34,261	3,110
Riksdag	21,949	5,119
ISOF	54	54
NST	250	250
Total	56,514	8,533

The default when loading our models through Hugging Face is Stage 2. We have however also uploaded continued pretraining checkpoints and tagged them. You can load these other checkpoints by specifying the revision in .from_pretrained(). The pretrained checkpoints tag can for example be found here: pretrained-checkpoint. The Stage 2 default model tag is named standard. We supply a different stage 2 checkpoint -- with a more condensed style of transcribing -- under the name subtitle.

Evaluation

WER

Model size		FLEURS	CommonVoice	NST
tiny	KBLab	13.2	12.9	11.2
	OpenAI	59.2	67.8	85.2
base	KBLab	9.1	8.7	7.8
	OpenAI	39.6	52.1	53.4
small	KBLab	7.3	6.4	6.6
	OpenAI	20.6	26.4	26.4
medium	KBLab	6.6	5.4	5.8
	OpenAI	12.1	15.8	17.1
large-v3	KBLab	5.4	4.1	5.2
	OpenAI	7.8	9.5	11.3

BLEU Score

Model size		FLEURS	CommonVoice	NST
tiny	KBLab	76.6	73.7	74.3
	OpenAI	26.9	21.1	24.0
base	KBLab	83.2	79.9	78.3
	OpenAI	41.1	32.5	36.9
small	KBLab	86.6	83.5	79.6
	OpenAI	64.0	56.5	58.2
medium	KBLab	87.6	85.0	80.2
	OpenAI	77.1	70.1	68.9
large-v3	KBLab	89.8	87.2	81.1
	OpenAI	84.9	79.1	75.1

Acknowledgements

We acknowledge the EuroHPC Joint Undertaking for awarding this project access to the EuroHPC supercomputer LEONARDO, hosted by CINECA (Italy) and the LEONARDO consortium through an EuroHPC AI and Data-Intensive Applications Access call.