Whisper-small-cv11-german Open-source Speech Recognition Model - Convert German Speech to Text with One-click for Capitalization and Punctuation

Whisper Small Cv11 German

Developed by bofenghuang

A speech recognition model fine-tuned on the Common Voice 11.0 German dataset based on openai/whisper-small, supporting German speech-to-text with case and punctuation prediction.

Speech Recognition

Transformers

GermanOpen Source License:Apache-2.0 #German speech recognition #Punctuation prediction #Fine-tuning optimization

Downloads 67

Release Time : 12/18/2022

Model Overview

This is an automatic speech recognition (ASR) model optimized for German, fine-tuned based on the Whisper-small architecture, suitable for German speech transcription tasks.

Model Features

German Optimization

Specially fine-tuned for German speech data, outperforming the original Whisper-small model on German recognition tasks.

Punctuation Prediction

Automatically predicts capitalization and punctuation, generating more standardized text output.

Efficient Inference

Compared to larger Whisper models, it maintains good performance while offering faster inference speed.

Model Capabilities

German speech recognition

Speech-to-text

Punctuation prediction

Case conversion

Use Cases

Speech Transcription

Meeting Minutes

Automatically transcribe German meeting recordings into text records

Word Error Rate 11.35%

Media Subtitle Generation

Automatically generate subtitles for German video content

Voice Assistants

German Voice Input

Provide speech recognition capabilities for German voice assistants

🚀 Fine-tuned whisper-small model for ASR in German

This model is a fine - tuned version of openai/whisper-small, trained on the mozilla-foundation/common_voice_11_0 German dataset. It's crucial to ensure that your speech input is sampled at 16Khz when using this model. Notably, this model can also predict casing and punctuation.

🚀 Quick Start

This fine - tuned model is designed for Automatic Speech Recognition (ASR) in German. It offers great performance and can be easily integrated into your projects.

✨ Features

Fine - tuned on German dataset: Trained on the mozilla-foundation/common_voice_11_0 German dataset for better German ASR performance.
Predict casing and punctuation: Capable of predicting casing and punctuation in the recognized text.

📦 Installation

No specific installation steps are provided in the original README. However, you need to have the necessary Python libraries such as torch, datasets, and transformers installed to use this model. You can install them using pip:

pip install torch datasets transformers

💻 Usage Examples

Basic Usage

Inference with 🤗 Pipeline

import torch

from datasets import load_dataset
from transformers import pipeline

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

# Load pipeline
pipe = pipeline("automatic-speech-recognition", model="bofenghuang/whisper-small-cv11-german", device=device)

# NB: set forced_decoder_ids for generation utils
pipe.model.config.forced_decoder_ids = pipe.tokenizer.get_decoder_prompt_ids(language="de", task="transcribe")

# Load data
ds_mcv_test = load_dataset("mozilla-foundation/common_voice_11_0", "de", split="test", streaming=True)
test_segment = next(iter(ds_mcv_test))
waveform = test_segment["audio"]

# NB: decoding option
# limit the maximum number of generated tokens to 225
pipe.model.config.max_length = 225 + 1
# sampling
# pipe.model.config.do_sample = True
# beam search
# pipe.model.config.num_beams = 5
# return
# pipe.model.config.return_dict_in_generate = True
# pipe.model.config.output_scores = True
# pipe.model.config.num_return_sequences = 5

# Run
generated_sentences = pipe(waveform)["text"]

Advanced Usage

Inference with 🤗 low - level APIs

import torch
import torchaudio

from datasets import load_dataset
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

# Load model
model = AutoModelForSpeechSeq2Seq.from_pretrained("bofenghuang/whisper-small-cv11-german").to(device)
processor = AutoProcessor.from_pretrained("bofenghuang/whisper-small-cv11-german", language="german", task="transcribe")

# NB: set forced_decoder_ids for generation utils
model.config.forced_decoder_ids = processor.get_decoder_prompt_ids(language="de", task="transcribe")

# 16_000
model_sample_rate = processor.feature_extractor.sampling_rate

# Load data
ds_mcv_test = load_dataset("mozilla-foundation/common_voice_11_0", "de", split="test", streaming=True)
test_segment = next(iter(ds_mcv_test))
waveform = torch.from_numpy(test_segment["audio"]["array"])
sample_rate = test_segment["audio"]["sampling_rate"]

# Resample
if sample_rate != model_sample_rate:
    resampler = torchaudio.transforms.Resample(sample_rate, model_sample_rate)
    waveform = resampler(waveform)

# Get feat
inputs = processor(waveform, sampling_rate=model_sample_rate, return_tensors="pt")
input_features = inputs.input_features
input_features = input_features.to(device)

# Generate
generated_ids = model.generate(inputs=input_features, max_new_tokens=225)  # greedy
# generated_ids = model.generate(inputs=input_features, max_new_tokens=225, num_beams=5)  # beam search

# Detokenize
generated_sentences = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

# Normalise predicted sentences if necessary

📚 Documentation

Performance

Below are the WERs of the pre - trained models on the Common Voice 9.0. These results are reported in the original paper.

Model	Common Voice 9.0
openai/whisper-small	13.0
openai/whisper-medium	8.5
openai/whisper-large-v2	6.4

Below are the WERs of the fine - tuned models on the Common Voice 11.0.

Model	Common Voice 11.0
bofenghuang/whisper-small-cv11-german	11.35
bofenghuang/whisper-medium-cv11-german	7.05
bofenghuang/whisper-large-v2-cv11-german	5.76

📄 License

This model is licensed under the Apache 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご