đ Fine-tuned whisper-medium model for ASR in German
This model is a fine - tuned version of openai/whisper-medium, trained on the German dataset from mozilla - foundation/common_voice_11_0. It can be used for Automatic Speech Recognition (ASR) in German. When using the model, ensure that your speech input is sampled at 16Khz. Notably, this model can also predict casing and punctuation.

This model is a converted version of bofenghuang/whisper-medium-cv11-german converted to ctranslate2.
đ Quick Start
This model is designed for Automatic Speech Recognition in German. Make sure your speech input is sampled at 16Khz.
⨠Features
- Fine - tuned: Based on openai/whisper-medium, fine - tuned on the German dataset of mozilla - foundation/common_voice_11_0.
- Predict Casing and Punctuation: It can predict casing and punctuation in the recognized text.
đ Documentation
Performance
Below are the WERs of the pre - trained models on the Common Voice 9.0. These results are reported in the original paper.
Below are the WERs of the fine - tuned models on the Common Voice 11.0.
Model Index
- Name: Fine - tuned whisper - medium model for ASR in German
- Results:
- Task:
- Name: Automatic Speech Recognition
- Type: automatic - speech - recognition
- Dataset:
- Name: Common Voice 11.0
- Type: mozilla - foundation/common_voice_11_0
- Config: de
- Split: test
- Args: de
- Metrics:
- Name: WER (Greedy)
- Type: wer
- Value: 7.05
đģ Usage Examples
Basic Usage
Inference with đ¤ Pipeline
import torch
from datasets import load_dataset
from transformers import pipeline
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
pipe = pipeline("automatic-speech-recognition", model="bofenghuang/whisper-medium-cv11-german", device=device)
pipe.model.config.forced_decoder_ids = pipe.tokenizer.get_decoder_prompt_ids(language="de", task="transcribe")
ds_mcv_test = load_dataset("mozilla-foundation/common_voice_11_0", "de", split="test", streaming=True)
test_segment = next(iter(ds_mcv_test))
waveform = test_segment["audio"]
pipe.model.config.max_length = 225 + 1
generated_sentences = pipe(waveform)["text"]
Inference with đ¤ low - level APIs
import torch
import torchaudio
from datasets import load_dataset
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model = AutoModelForSpeechSeq2Seq.from_pretrained("bofenghuang/whisper-medium-cv11-german").to(device)
processor = AutoProcessor.from_pretrained("bofenghuang/whisper-medium-cv11-german", language="german", task="transcribe")
model.config.forced_decoder_ids = processor.get_decoder_prompt_ids(language="de", task="transcribe")
model_sample_rate = processor.feature_extractor.sampling_rate
ds_mcv_test = load_dataset("mozilla-foundation/common_voice_11_0", "de", split="test", streaming=True)
test_segment = next(iter(ds_mcv_test))
waveform = torch.from_numpy(test_segment["audio"]["array"])
sample_rate = test_segment["audio"]["sampling_rate"]
if sample_rate != model_sample_rate:
resampler = torchaudio.transforms.Resample(sample_rate, model_sample_rate)
waveform = resampler(waveform)
inputs = processor(waveform, sampling_rate=model_sample_rate, return_tensors="pt")
input_features = inputs.input_features
input_features = input_features.to(device)
generated_ids = model.generate(inputs=input_features, max_new_tokens=225)
generated_sentences = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
đ License
This model is released under the Apache - 2.0 license.