đ Fine-tuned whisper-small model for ASR in German
This model is a fine - tuned version of openai/whisper-small, trained on the mozilla-foundation/common_voice_11_0 German dataset. It's crucial to ensure that your speech input is sampled at 16Khz when using this model. Notably, this model can also predict casing and punctuation.

đ Quick Start
This fine - tuned model is designed for Automatic Speech Recognition (ASR) in German. It offers great performance and can be easily integrated into your projects.
⨠Features
- Fine - tuned on German dataset: Trained on the mozilla-foundation/common_voice_11_0 German dataset for better German ASR performance.
- Predict casing and punctuation: Capable of predicting casing and punctuation in the recognized text.
đĻ Installation
No specific installation steps are provided in the original README. However, you need to have the necessary Python libraries such as torch
, datasets
, and transformers
installed to use this model. You can install them using pip
:
pip install torch datasets transformers
đģ Usage Examples
Basic Usage
Inference with đ¤ Pipeline
import torch
from datasets import load_dataset
from transformers import pipeline
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
pipe = pipeline("automatic-speech-recognition", model="bofenghuang/whisper-small-cv11-german", device=device)
pipe.model.config.forced_decoder_ids = pipe.tokenizer.get_decoder_prompt_ids(language="de", task="transcribe")
ds_mcv_test = load_dataset("mozilla-foundation/common_voice_11_0", "de", split="test", streaming=True)
test_segment = next(iter(ds_mcv_test))
waveform = test_segment["audio"]
pipe.model.config.max_length = 225 + 1
generated_sentences = pipe(waveform)["text"]
Advanced Usage
Inference with đ¤ low - level APIs
import torch
import torchaudio
from datasets import load_dataset
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model = AutoModelForSpeechSeq2Seq.from_pretrained("bofenghuang/whisper-small-cv11-german").to(device)
processor = AutoProcessor.from_pretrained("bofenghuang/whisper-small-cv11-german", language="german", task="transcribe")
model.config.forced_decoder_ids = processor.get_decoder_prompt_ids(language="de", task="transcribe")
model_sample_rate = processor.feature_extractor.sampling_rate
ds_mcv_test = load_dataset("mozilla-foundation/common_voice_11_0", "de", split="test", streaming=True)
test_segment = next(iter(ds_mcv_test))
waveform = torch.from_numpy(test_segment["audio"]["array"])
sample_rate = test_segment["audio"]["sampling_rate"]
if sample_rate != model_sample_rate:
resampler = torchaudio.transforms.Resample(sample_rate, model_sample_rate)
waveform = resampler(waveform)
inputs = processor(waveform, sampling_rate=model_sample_rate, return_tensors="pt")
input_features = inputs.input_features
input_features = input_features.to(device)
generated_ids = model.generate(inputs=input_features, max_new_tokens=225)
generated_sentences = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
đ Documentation
Performance
Below are the WERs of the pre - trained models on the Common Voice 9.0. These results are reported in the original paper.
Below are the WERs of the fine - tuned models on the Common Voice 11.0.
đ License
This model is licensed under the Apache 2.0 license.