đ KB-Whisper Large
The National Library of Sweden has released a new suite of Whisper models trained on over 50,000 hours of Swedish speech. In evaluations across FLEURS, CommonVoice and NST, our top - performing model reduces the Word Error Rate (WER) by an average of 47% compared to OpenAI's whisper-large-v3
. The performance of smaller Whisper model sizes on Swedish speech has also significantly improved, with kb-whisper-small
outperforming openai/whisper-large-v3
(a model six times its size).
đ Quick Start
The following sections provide detailed information about the model, including its performance, usage examples, training data, and evaluation results.
⨠Features
- Improved Performance: Reduces the Word Error Rate (WER) by an average of 47% compared to OpenAI's
whisper-large-v3
in evaluations across multiple datasets.
- Multiple Model Sizes: Smaller Whisper model sizes also show substantial performance improvements on Swedish speech.
- Diverse Usage: Can be used with various libraries such as Hugging Face, Faster - whisper, WhisperX, etc.
đĻ Installation
The README does not provide specific installation steps, so this section is skipped.
đģ Usage Examples
Basic Usage
Hugging Face
import torch
from datasets import load_dataset
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "KBLab/kb-whisper-large"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id, torch_dtype=torch_dtype, use_safetensors=True, cache_dir="cache"
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)
pipe = pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
torch_dtype=torch_dtype,
device=device,
)
generate_kwargs = {"task": "transcribe", "language": "sv"}
res = pipe("audio.mp3",
chunk_length_s=30,
generate_kwargs={"task": "transcribe", "language": "sv"})
print(res)
Advanced Usage
Faster - whisper
from faster_whisper import WhisperModel
model_id = "KBLab/kb-whisper-large"
model = WhisperModel(
model_id,
device="cuda",
compute_type="float16",
download_root="cache",
)
segments, info = model.transcribe("audio.wav", condition_on_previous_text=False)
print("Detected language '%s' with probability %f" % (info.language, info.language_probability))
for segment in segments:
print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))
WhisperX
import whisperx
device = "cuda"
audio_file = "audio.wav"
batch_size = 16
compute_type = "float16"
model = whisperx.load_model(
"KBLab/kb-whisper-large", device, compute_type=compute_type, download_root="cache"
)
audio = whisperx.load_audio(audio_file)
result = model.transcribe(audio, batch_size=batch_size)
print(result["segments"])
model_a, metadata = whisperx.load_align_model(
language_code=result["language"],
device=device,
model_name="KBLab/wav2vec2-large-voxrex-swedish",
model_dir="cache",
)
result = whisperx.align(
result["segments"], model_a, metadata, audio, device, return_char_alignments=False
)
print(result["segments"])
Whisper.cpp / GGML
git clone https://github.com/ggerganov/whisper.cpp.git
cd whisper.cpp
cmake -B build
cmake --build build --config Release
wget https://huggingface.co/KBLab/kb-whisper-large/resolve/main/ggml-model-q5_0.bin # Quantized version
# wget https://huggingface.co/KBLab/kb-whisper-large/resolve/main/ggml-model.bin # Non-quantized version
./build/bin/whisper-cli -m ggml-model-q5_0.bin ../audio.wav
onnx (optimum) and transformers.js usage
from optimum.onnxruntime import ORTModelForSpeechSeq2Seq
from transformers import AutoProcessor
model_id = "KBLab/kb-whisper-large"
processor = AutoProcessor.from_pretrained(model_id, cache_dir="cache")
model = ORTModelForSpeechSeq2Seq.from_pretrained(
model_id,
cache_dir="cache",
subfolder="onnx",
)
import soundfile as sf
audio = sf.read("audio.wav")
inputs = processor.feature_extractor(audio[0], sampling_rate=16000, return_tensors="pt")
gen_tokens = model.generate(**inputs, max_length=300)
processor.decode(gen_tokens[0], skip_special_tokens=True)
đ Documentation
Model Information
Property |
Details |
Model Type |
Automatic Speech Recognition |
Training Data |
Over 50,000 hours of Swedish audio with text transcriptions |
Training Data
Our models have been trained on over 50,000 hours of Swedish audio with text transcriptions. The models were trained in 2 stages, each characterized by the application of different quality filters and thresholds for said filters.
Stage 1 employed low threshold values (0 to 0.30 BLEU depending on dataset), whereas Stage 2 used stricter thresholds (BLEU >= 0.7
, weighted ROUGE - N >= 0.7
, CER of first and last 10 characters <= 0.2
).
Dataset |
Continued pretraining (h) -- Stage 1 |
Finetuning (h) -- Stage 2 |
Subtitles |
34,261 |
3,110 |
Riksdag |
21,949 |
5,119 |
ISOF |
54 |
54 |
NST |
250 |
250 |
Total |
56,514 |
8,533 |
The default when loading our models through Hugging Face is Stage 2. We have however also uploaded continued pretraining checkpoints and tagged them. You can load these other checkpoints by specifying the revision
in .from_pretrained()
. The pretrained checkpoints tag can for example be found here: pretrained-checkpoint
. The Stage 2 default model tag is named standard
. We supply a different stage 2 checkpoint -- with a more condensed style of transcribing -- under the name subtitle
.
Evaluation
WER
Model size |
|
FLEURS |
CommonVoice |
NST |
tiny |
KBLab |
13.2 |
12.9 |
11.2 |
|
OpenAI |
59.2 |
67.8 |
85.2 |
base |
KBLab |
9.1 |
8.7 |
7.8 |
|
OpenAI |
39.6 |
52.1 |
53.4 |
small |
KBLab |
7.3 |
6.4 |
6.6 |
|
OpenAI |
20.6 |
26.4 |
26.4 |
medium |
KBLab |
6.6 |
5.4 |
5.8 |
|
OpenAI |
12.1 |
15.8 |
17.1 |
large-v3 |
KBLab |
5.4 |
4.1 |
5.2 |
|
OpenAI |
7.8 |
9.5 |
11.3 |
BLEU Score
Model size |
|
FLEURS |
CommonVoice |
NST |
tiny |
KBLab |
76.6 |
73.7 |
74.3 |
|
OpenAI |
26.9 |
21.1 |
24.0 |
base |
KBLab |
83.2 |
79.9 |
78.3 |
|
OpenAI |
41.1 |
32.5 |
36.9 |
small |
KBLab |
86.6 |
83.5 |
79.6 |
|
OpenAI |
64.0 |
56.5 |
58.2 |
medium |
KBLab |
87.6 |
85.0 |
80.2 |
|
OpenAI |
77.1 |
70.1 |
68.9 |
large-v3 |
KBLab |
89.8 |
87.2 |
81.1 |
|
OpenAI |
84.9 |
79.1 |
75.1 |
đ§ Technical Details
The README does not provide specific technical details, so this section is skipped.
đ License
The model is licensed under the apache - 2.0
license.