license: apache-2.0
language: de
library_name: transformers
thumbnail: null
tags:
- automatic-speech-recognition
- whisper-event
datasets:
- mozilla-foundation/common_voice_11_0
metrics:
- wer
model-index:
- name: Fine-tuned whisper-large-v2 model for ASR in German
results:
- task:
name: Automatic Speech Recognition
type: automatic-speech-recognition
dataset:
name: Common Voice 11.0
type: mozilla-foundation/common_voice_11_0
config: de
split: test
args: de
metrics:
- name: WER (Greedy)
type: wer
value: 5.76

Fine-tuned whisper-large-v2 model for ASR in German
This model is a fine-tuned version of openai/whisper-large-v2, trained on the mozilla-foundation/common_voice_11_0 de dataset. When using the model make sure that your speech input is also sampled at 16Khz. This model also predicts casing and punctuation.
Performance
Below are the WERs of the pre-trained models on the Common Voice 9.0. These results are reported in the original paper.
Below are the WERs of the fine-tuned models on the Common Voice 11.0.
Usage
Inference with 🤗 Pipeline
import torch
from datasets import load_dataset
from transformers import pipeline
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
pipe = pipeline("automatic-speech-recognition", model="bofenghuang/whisper-large-v2-cv11-german", device=device)
pipe.model.config.forced_decoder_ids = pipe.tokenizer.get_decoder_prompt_ids(language="de", task="transcribe")
ds_mcv_test = load_dataset("mozilla-foundation/common_voice_11_0", "de", split="test", streaming=True)
test_segment = next(iter(ds_mcv_test))
waveform = test_segment["audio"]
pipe.model.config.max_length = 225 + 1
generated_sentences = pipe(waveform)["text"]
Inference with 🤗 low-level APIs
import torch
import torchaudio
from datasets import load_dataset
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model = AutoModelForSpeechSeq2Seq.from_pretrained("bofenghuang/whisper-large-v2-cv11-german").to(device)
processor = AutoProcessor.from_pretrained("bofenghuang/whisper-large-v2-cv11-german", language="german", task="transcribe")
model.config.forced_decoder_ids = processor.get_decoder_prompt_ids(language="de", task="transcribe")
model_sample_rate = processor.feature_extractor.sampling_rate
ds_mcv_test = load_dataset("mozilla-foundation/common_voice_11_0", "de", split="test", streaming=True)
test_segment = next(iter(ds_mcv_test))
waveform = torch.from_numpy(test_segment["audio"]["array"])
sample_rate = test_segment["audio"]["sampling_rate"]
if sample_rate != model_sample_rate:
resampler = torchaudio.transforms.Resample(sample_rate, model_sample_rate)
waveform = resampler(waveform)
inputs = processor(waveform, sampling_rate=model_sample_rate, return_tensors="pt")
input_features = inputs.input_features
input_features = input_features.to(device)
generated_ids = model.generate(inputs=input_features, max_new_tokens=225)
generated_sentences = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]