đ distil-whisper-german
This model is a German Speech Recognition model based on the distil-whisper technique. It has 756M parameters and a size of 1.51GB in bfloat16 format. As a follow - up to the Whisper large v3 german, we created a distilled version for faster inference with minimal quality loss.
đ Quick Start
The model is intended to be used for German speech recognition tasks. It can serve as a local transcription service or be integrated into a larger speech - recognition pipeline. With only half the parameters of the large model, it still offers good quality for most tasks. When using optimization toolkits like tensorrt, its low latency makes it suitable for real - time applications.
⨠Features
- Fast Inference: A distilled version for quicker results with minimal quality loss.
- Good Quality: Despite having fewer parameters, it maintains high - quality performance for most German speech recognition tasks.
- Low Latency: Suitable for real - time applications when optimized.
đĻ Installation
No specific installation steps are provided in the original document.
đģ Usage Examples
Basic Usage
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "primeline/distil-whisper-large-v3-german"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)
pipe = pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
max_new_tokens=128,
chunk_length_s=30,
batch_size=16,
return_timestamps=True,
torch_dtype=torch_dtype,
device=device,
)
dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
sample = dataset[0]["audio"]
result = pipe(sample)
print(result["text"])
đ Documentation
Dataset
The dataset used for training is a filtered subset of the Common Voice dataset, multilingual librispeech, and some internal data. The data was carefully filtered and double - checked for quality and correctness. Text data normalization was performed, especially for casing and punctuation.
Model family
Property |
Details |
Model Type |
German Speech Recognition |
Training Data |
A filtered subset of the Common Voice dataset, multilingual librispeech, and some internal data |
Model |
Parameters |
link |
Whisper large v3 german |
1.54B |
link |
Whisper large v3 turbo german |
809M |
link |
Distil - whisper large v3 german |
756M |
link |
tiny whisper |
37.8M |
link |
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 3e - 05
- total_train_batch_size: 512
- num_epochs: 5.0
Framework versions
- Transformers 4.39.3
- Pytorch 2.3.0a0+ebedce2
- Datasets 2.18.0
- Tokenizers 0.15.2
đ§ Technical Details
The model is a distilled version of the German speech - recognition model, aiming to achieve faster inference with minimal quality loss. It uses a filtered and high - quality dataset for training and specific hyperparameters to optimize performance.
đ License
This model is published under the Apache 2.0 license.
About us

Your partner for AI infrastructure in Germany. Experience the powerful AI infrastructure that drives your ambitions in Deep Learning, Machine Learning & High - Performance Computing. Optimized for AI training and inference.
Model author: Florian Zimmermeister
â ī¸ Important Note
This model is not a product of the primeLine Group. It represents research conducted by Florian Zimmermeister, with computing power sponsored by primeLine. The model is published under this account by primeLine, but it is not a commercial product of primeLine Solutions GmbH. Please be aware that while we have tested and developed this model to the best of our abilities, errors may still occur. Use of this model is at your own risk. We do not accept liability for any incorrect outputs generated by this model.