kb - whisper - large, an open - source Swedish speech recognition model trained with 50,000 hours of data to reduce word error rate

Kb Whisper Large

Developed by KBLab

A Swedish speech recognition model based on the Whisper architecture released by the National Library of Sweden. The training data exceeds 50,000 hours, significantly reducing the word error rate.

Speech Recognition

Transformers

OtherOpen Source License:Apache-2.0 #Swedish speech recognition #Low word error rate #Multi-format support

Downloads 8,880

Release Time : 2/14/2025

Model Overview

A speech recognition model optimized for Swedish, based on the OpenAI Whisper architecture, and performs excellently on multiple Swedish datasets.

Model Features

Significantly reduce the word error rate

Compared with the original OpenAI model, it reduces the word error rate (WER) by an average of 47% in Swedish recognition.

Multi-format support

Provides model checkpoints in multiple formats, including Hugging Face, whisper.cpp (GGML), onnx, and ctranslate2.

Multiple transcription styles

Provides three transcription style versions: subtitle version (concise), standard version (default), and strict version (verbatim).

Large-scale training data

Trained on over 50,000 hours of Swedish speech data, with training conducted in two quality stages.

Model Capabilities

Swedish speech recognition

Speech transcription with timestamps

Multi-format inference support

Batch speech transcription

Use Cases

Speech transcription

Meeting record transcription

Convert Swedish meeting recordings into text records.

High-accuracy transcribed text

Subtitle generation

Generate subtitles for Swedish video content.

Subtitle files with timestamps

Speech analysis

Speech content analysis

Analyze Swedish speech content for subsequent processing.

Structured text data

🚀 KB-Whisper Large

The National Library of Sweden has released a new suite of Whisper models trained on over 50,000 hours of Swedish speech. In evaluations across FLEURS, CommonVoice and NST, our top - performing model reduces the Word Error Rate (WER) by an average of 47% compared to OpenAI's whisper-large-v3. The performance of smaller Whisper model sizes on Swedish speech has also significantly improved, with kb-whisper-small outperforming openai/whisper-large-v3 (a model six times its size).

🚀 Quick Start

The following sections provide detailed information about the model, including its performance, usage examples, training data, and evaluation results.

✨ Features

Improved Performance: Reduces the Word Error Rate (WER) by an average of 47% compared to OpenAI's whisper-large-v3 in evaluations across multiple datasets.
Multiple Model Sizes: Smaller Whisper model sizes also show substantial performance improvements on Swedish speech.
Diverse Usage: Can be used with various libraries such as Hugging Face, Faster - whisper, WhisperX, etc.

📦 Installation

The README does not provide specific installation steps, so this section is skipped.

💻 Usage Examples

Basic Usage

Hugging Face

import torch
from datasets import load_dataset
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "KBLab/kb-whisper-large"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, use_safetensors=True, cache_dir="cache"
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    torch_dtype=torch_dtype,
    device=device,
)

generate_kwargs = {"task": "transcribe", "language": "sv"}
# Add return_timestamps=True for output with timestamps
res = pipe("audio.mp3", 
           chunk_length_s=30,
           generate_kwargs={"task": "transcribe", "language": "sv"})
print(res)

Advanced Usage

Faster - whisper

#### faster-whisper model ####
from faster_whisper import WhisperModel

model_id = "KBLab/kb-whisper-large"
model = WhisperModel(
    model_id,
    device="cuda",
    compute_type="float16",
    download_root="cache", # cache directory
    # condition_on_previous_text = False # Can reduce hallucinations if we don't use prompts
)

# Transcribe audio.wav (convert to 16khz mono wav first via ffmpeg)
segments, info = model.transcribe("audio.wav", condition_on_previous_text=False)
print("Detected language '%s' with probability %f" % (info.language, info.language_probability))

for segment in segments:
    print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))

WhisperX

import whisperx

device = "cuda"
audio_file = "audio.wav"
batch_size = 16  # reduce if low on GPU mem
compute_type = "float16"  # change to "int8" if low on GPU mem (may reduce accuracy)

# 1. Transcribe with original whisper (batched)
model = whisperx.load_model(
    "KBLab/kb-whisper-large", device, compute_type=compute_type, download_root="cache"  # cache_dir
)

audio = whisperx.load_audio(audio_file)
result = model.transcribe(audio, batch_size=batch_size)
print(result["segments"])  # before alignment

# delete model if low on GPU resources
# import gc; gc.collect(); torch.cuda.empty_cache(); del model

# 2. Align whisper output
model_a, metadata = whisperx.load_align_model(
    language_code=result["language"],
    device=device,
    model_name="KBLab/wav2vec2-large-voxrex-swedish",
    model_dir="cache",  # cache_dir
)
result = whisperx.align(
    result["segments"], model_a, metadata, audio, device, return_char_alignments=False
)

print(result["segments"])  # word level timestamps after alignment

Whisper.cpp / GGML

git clone https://github.com/ggerganov/whisper.cpp.git
cd whisper.cpp
cmake -B build
cmake --build build --config Release

wget https://huggingface.co/KBLab/kb-whisper-large/resolve/main/ggml-model-q5_0.bin # Quantized version
# wget https://huggingface.co/KBLab/kb-whisper-large/resolve/main/ggml-model.bin # Non-quantized version

./build/bin/whisper-cli -m ggml-model-q5_0.bin ../audio.wav

onnx (optimum) and transformers.js usage

from optimum.onnxruntime import ORTModelForSpeechSeq2Seq
from transformers import AutoProcessor

model_id = "KBLab/kb-whisper-large"
processor = AutoProcessor.from_pretrained(model_id, cache_dir="cache")
model = ORTModelForSpeechSeq2Seq.from_pretrained(
    model_id,
    cache_dir="cache",
    subfolder="onnx",
)

import soundfile as sf
audio = sf.read("audio.wav")

inputs = processor.feature_extractor(audio[0], sampling_rate=16000, return_tensors="pt")
gen_tokens = model.generate(**inputs, max_length=300)
processor.decode(gen_tokens[0], skip_special_tokens=True)

📚 Documentation

Model Information

Property	Details
Model Type	Automatic Speech Recognition
Training Data	Over 50,000 hours of Swedish audio with text transcriptions

Training Data

Our models have been trained on over 50,000 hours of Swedish audio with text transcriptions. The models were trained in 2 stages, each characterized by the application of different quality filters and thresholds for said filters.

Stage 1 employed low threshold values (0 to 0.30 BLEU depending on dataset), whereas Stage 2 used stricter thresholds (BLEU >= 0.7, weighted ROUGE - N >= 0.7, CER of first and last 10 characters <= 0.2).

Dataset	Continued pretraining (h) -- Stage 1	Finetuning (h) -- Stage 2
Subtitles	34,261	3,110
Riksdag	21,949	5,119
ISOF	54	54
NST	250	250
Total	56,514	8,533

The default when loading our models through Hugging Face is Stage 2. We have however also uploaded continued pretraining checkpoints and tagged them. You can load these other checkpoints by specifying the revision in .from_pretrained(). The pretrained checkpoints tag can for example be found here: pretrained-checkpoint. The Stage 2 default model tag is named standard. We supply a different stage 2 checkpoint -- with a more condensed style of transcribing -- under the name subtitle.

Evaluation

WER

Model size		FLEURS	CommonVoice	NST
tiny	KBLab	13.2	12.9	11.2
	OpenAI	59.2	67.8	85.2
base	KBLab	9.1	8.7	7.8
	OpenAI	39.6	52.1	53.4
small	KBLab	7.3	6.4	6.6
	OpenAI	20.6	26.4	26.4
medium	KBLab	6.6	5.4	5.8
	OpenAI	12.1	15.8	17.1
large-v3	KBLab	5.4	4.1	5.2
	OpenAI	7.8	9.5	11.3

BLEU Score

Model size		FLEURS	CommonVoice	NST
tiny	KBLab	76.6	73.7	74.3
	OpenAI	26.9	21.1	24.0
base	KBLab	83.2	79.9	78.3
	OpenAI	41.1	32.5	36.9
small	KBLab	86.6	83.5	79.6
	OpenAI	64.0	56.5	58.2
medium	KBLab	87.6	85.0	80.2
	OpenAI	77.1	70.1	68.9
large-v3	KBLab	89.8	87.2	81.1
	OpenAI	84.9	79.1	75.1

🔧 Technical Details

The README does not provide specific technical details, so this section is skipped.

📄 License

The model is licensed under the apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご