whisper-medium-cv11-german-ct2 Open-Source Model - Accurately Realize Automatic German Speech Recognition

Whisper Medium Cv11 German Ct2

Developed by mkenfenheuer

Automatic speech recognition model fine-tuned on the Common Voice 11.0 German dataset based on OpenAI's whisper-medium model

Speech Recognition

Transformers

GermanOpen Source License:Apache-2.0 #German Speech Recognition #High Precision WER7.05 #Punctuation Prediction

Downloads 21

Release Time : 1/13/2025

Model Overview

This model is specifically designed for German automatic speech recognition tasks, capable of predicting capitalization and punctuation, and requires input audio with a sampling rate of 16kHz.

Model Features

High Precision German Recognition

Achieves a WER (Word Error Rate) of 7.05% on the Common Voice 11.0 German test set

Punctuation Prediction

Automatically predicts capitalization and punctuation, improving the readability of transcribed text

Based on Whisper Architecture

Fine-tuned on OpenAI's powerful Whisper-medium model, inheriting its excellent speech recognition capabilities

Model Capabilities

German Speech Recognition

Punctuation Prediction

Capitalization Recognition

Use Cases

Speech Transcription

German Meeting Minutes

Automatically transcribe German meeting recordings into punctuated text records

Highly accurate transcribed text

German Media Subtitle Generation

Automatically generate subtitles for German video content

Accurate subtitle text

🚀 Fine-tuned whisper-medium model for ASR in German

This model is a fine - tuned version of openai/whisper-medium, trained on the German dataset from mozilla - foundation/common_voice_11_0. It can be used for Automatic Speech Recognition (ASR) in German. When using the model, ensure that your speech input is sampled at 16Khz. Notably, this model can also predict casing and punctuation.

This model is a converted version of bofenghuang/whisper-medium-cv11-german converted to ctranslate2.

🚀 Quick Start

This model is designed for Automatic Speech Recognition in German. Make sure your speech input is sampled at 16Khz.

✨ Features

Fine - tuned: Based on openai/whisper-medium, fine - tuned on the German dataset of mozilla - foundation/common_voice_11_0.
Predict Casing and Punctuation: It can predict casing and punctuation in the recognized text.

📚 Documentation

Performance

Below are the WERs of the pre - trained models on the Common Voice 9.0. These results are reported in the original paper.

Model	Common Voice 9.0
openai/whisper-small	13.0
openai/whisper-medium	8.5
openai/whisper-large-v2	6.4

Below are the WERs of the fine - tuned models on the Common Voice 11.0.

Model	Common Voice 11.0
bofenghuang/whisper-small-cv11-german	11.35
bofenghuang/whisper-medium-cv11-german	7.05
bofenghuang/whisper-large-v2-cv11-german	5.76

Model Index

Name: Fine - tuned whisper - medium model for ASR in German
- Results:
  - Task:
    - Name: Automatic Speech Recognition
    - Type: automatic - speech - recognition
  - Dataset:
    - Name: Common Voice 11.0
    - Type: mozilla - foundation/common_voice_11_0
    - Config: de
    - Split: test
    - Args: de
  - Metrics:
    - Name: WER (Greedy)
    - Type: wer
    - Value: 7.05

💻 Usage Examples

Basic Usage

Inference with 🤗 Pipeline

import torch

from datasets import load_dataset
from transformers import pipeline

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

# Load pipeline
pipe = pipeline("automatic-speech-recognition", model="bofenghuang/whisper-medium-cv11-german", device=device)

# NB: set forced_decoder_ids for generation utils
pipe.model.config.forced_decoder_ids = pipe.tokenizer.get_decoder_prompt_ids(language="de", task="transcribe")

# Load data
ds_mcv_test = load_dataset("mozilla-foundation/common_voice_11_0", "de", split="test", streaming=True)
test_segment = next(iter(ds_mcv_test))
waveform = test_segment["audio"]

# NB: decoding option
# limit the maximum number of generated tokens to 225
pipe.model.config.max_length = 225 + 1
# sampling
# pipe.model.config.do_sample = True
# beam search
# pipe.model.config.num_beams = 5
# return
# pipe.model.config.return_dict_in_generate = True
# pipe.model.config.output_scores = True
# pipe.model.config.num_return_sequences = 5

# Run
generated_sentences = pipe(waveform)["text"]

Inference with 🤗 low - level APIs

import torch
import torchaudio

from datasets import load_dataset
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

# Load model
model = AutoModelForSpeechSeq2Seq.from_pretrained("bofenghuang/whisper-medium-cv11-german").to(device)
processor = AutoProcessor.from_pretrained("bofenghuang/whisper-medium-cv11-german", language="german", task="transcribe")

# NB: set forced_decoder_ids for generation utils
model.config.forced_decoder_ids = processor.get_decoder_prompt_ids(language="de", task="transcribe")

# 16_000
model_sample_rate = processor.feature_extractor.sampling_rate

# Load data
ds_mcv_test = load_dataset("mozilla-foundation/common_voice_11_0", "de", split="test", streaming=True)
test_segment = next(iter(ds_mcv_test))
waveform = torch.from_numpy(test_segment["audio"]["array"])
sample_rate = test_segment["audio"]["sampling_rate"]

# Resample
if sample_rate != model_sample_rate:
    resampler = torchaudio.transforms.Resample(sample_rate, model_sample_rate)
    waveform = resampler(waveform)

# Get feat
inputs = processor(waveform, sampling_rate=model_sample_rate, return_tensors="pt")
input_features = inputs.input_features
input_features = input_features.to(device)

# Generate
generated_ids = model.generate(inputs=input_features, max_new_tokens=225)  # greedy
# generated_ids = model.generate(inputs=input_features, max_new_tokens=225, num_beams=5)  # beam search

# Detokenize
generated_sentences = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

# Normalise predicted sentences if necessary

📄 License

This model is released under the Apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご