Wav2vec2 Large Xlsr German
An automatic speech recognition (ASR) model fine-tuned on the Common Voice German dataset based on Facebook's wav2vec2-large-xlsr-53 model.
Downloads 253
Release Time : 3/2/2022
Model Overview
This is an automatic speech recognition model optimized for German, capable of converting German speech into text, suitable for application scenarios that require speech-to-text conversion.
Model Features
High-precision German recognition
Achieved a WER (Word Error Rate) of 12.77% on the Common Voice German test set.
Based on the XLSR architecture
Uses facebook/wav2vec2-large-xlsr-53 as the base model, with powerful speech feature extraction capabilities.
No need for a language model
Can be used directly without additional language model support.
Model Capabilities
German speech recognition
16kHz audio processing
Batch speech-to-text conversion
Use Cases
Speech transcription
German meeting records
Automatically convert German meeting recordings into text records.
Accuracy of approximately 87.23% (based on 12.77% WER)
Voice assistant
Provide speech recognition capabilities for German voice assistants.
Education
Language learning application
Help learners practice German pronunciation and listening.
🚀 Wav2Vec2-Large-XLSR-53-German
This model is a fine - tuned version of facebook/wav2vec2-large-xlsr-53 on German using the Common Voice dataset. It's designed for automatic speech recognition tasks.
📋 Metadata
Property | Details |
---|---|
Datasets | common_voice |
Metrics | wer |
Tags | audio, automatic - speech - recognition, speech, xlsr - fine - tuning - week |
License | apache - 2.0 |
📊 Model Index
- Name: XLSR Wav2Vec2 Large 53 CV - de
- Results:
- Task:
- Name: Speech Recognition
- Type: automatic - speech - recognition
- Dataset:
- Name: Common Voice de
- Type: common_voice
- Args: de
- Metrics:
- Name: Test WER
- Type: wer
- Value: 12.77
- Task:
🚀 Quick Start
When using this model, make sure that your speech input is sampled at 16kHz.
💻 Usage Examples
Basic Usage
import torch
import torchaudio
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
test_dataset = load_dataset("common_voice", "de", split="test[:8]") # use a batch of 8 for demo purposes
processor = Wav2Vec2Processor.from_pretrained("maxidl/wav2vec2-large-xlsr-german")
model = Wav2Vec2ForCTC.from_pretrained("maxidl/wav2vec2-large-xlsr-german")
resampler = torchaudio.transforms.Resample(48_000, 16_000)
"""
Preprocessing the dataset by:
- loading audio files
- resampling to 16kHz
- converting to array
- prepare input tensor using the processor
"""
def speech_file_to_array_fn(batch):
speech_array, sampling_rate = torchaudio.load(batch["path"])
batch["speech"] = resampler(speech_array).squeeze().numpy()
return batch
test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
# run forward
with torch.no_grad():
logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
predicted_ids = torch.argmax(logits, dim=-1)
print("Prediction:", processor.batch_decode(predicted_ids))
print("Reference:", test_dataset["sentence"])
"""
Example Result:
Prediction: [
'zieh durch bittet draußen die schuhe aus',
'es kommt zugvorgebauten fo',
'ihre vorterstrecken erschienen it modemagazinen wie der voge karpes basar mariclair',
'fürliepert eine auch für manachen ungewöhnlich lange drittelliste',
'er wurde zu ehren des reichskanzlers otto von bismarck errichtet',
'was solls ich bin bereit',
'das internet besteht aus vielen computern die miteinander verbunden sind',
'der uranus ist der siebinteplanet in unserem sonnensystem s'
]
Reference: [
'Zieht euch bitte draußen die Schuhe aus.',
'Es kommt zum Showdown in Gstaad.',
'Ihre Fotostrecken erschienen in Modemagazinen wie der Vogue, Harper’s Bazaar und Marie Claire.',
'Felipe hat eine auch für Monarchen ungewöhnlich lange Titelliste.',
'Er wurde zu Ehren des Reichskanzlers Otto von Bismarck errichtet.',
'Was solls, ich bin bereit.',
'Das Internet besteht aus vielen Computern, die miteinander verbunden sind.',
'Der Uranus ist der siebente Planet in unserem Sonnensystem.'
]
"""
🔧 Evaluation
The model can be evaluated as follows on the German test data of Common Voice:
import re
import torch
import torchaudio
from datasets import load_dataset, load_metric
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
"""
Evaluation on the full test set:
- takes ~20mins (RTX 3090).
- requires ~170GB RAM to compute the WER. Below, we use a chunked implementation of WER to avoid large RAM consumption.
"""
test_dataset = load_dataset("common_voice", "de", split="test") # use "test[:1%]" for 1% sample
wer = load_metric("wer")
processor = Wav2Vec2Processor.from_pretrained("maxidl/wav2vec2-large-xlsr-german")
model = Wav2Vec2ForCTC.from_pretrained("maxidl/wav2vec2-large-xlsr-german")
model.to("cuda")
chars_to_ignore_regex = '[\\,\\?\\.\\!\\-\\;\\:\\\"\\“]'
resampler = torchaudio.transforms.Resample(48_000, 16_000)
# Preprocessing the datasets.
# We need to read the aduio files as arrays
def speech_file_to_array_fn(batch):
batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower()
speech_array, sampling_rate = torchaudio.load(batch["path"])
batch["speech"] = resampler(speech_array).squeeze().numpy()
return batch
test_dataset = test_dataset.map(speech_file_to_array_fn)
# Preprocessing the datasets.
# We need to read the audio files as arrays
def evaluate(batch):
inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
with torch.no_grad():
logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits
pred_ids = torch.argmax(logits, dim=-1)
batch["pred_strings"] = processor.batch_decode(pred_ids)
return batch
result = test_dataset.map(evaluate, batched=True, batch_size=8) # batch_size=8 -> requires ~14.5GB GPU memory
# non-chunked version:
# print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))
# WER: 12.900291
# Chunked version, see https://discuss.huggingface.co/t/spanish-asr-fine-tuning-wav2vec2/4586/5:
import jiwer
def chunked_wer(targets, predictions, chunk_size=None):
if chunk_size is None: return jiwer.wer(targets, predictions)
start = 0
end = chunk_size
H, S, D, I = 0, 0, 0, 0
while start < len(targets):
chunk_metrics = jiwer.compute_measures(targets[start:end], predictions[start:end])
H = H + chunk_metrics["hits"]
S = S + chunk_metrics["substitutions"]
D = D + chunk_metrics["deletions"]
I = I + chunk_metrics["insertions"]
start += chunk_size
end += chunk_size
return float(S + D + I) / float(H + S + D)
print("Total (chunk_size=1000), WER: {:2f}".format(100 * chunked_wer(result["pred_strings"], result["sentence"], chunk_size=1000)))
# Total (chunk=1000), WER: 12.768981
Test Result: WER: 12.77 %
🏋️♂️ Training
The Common Voice German train
and validation
were used for training.
The script used for training can be found here.
The model was trained for 50k steps, taking around 30 hours on a single A100.
The arguments used for training this model are:
python run_finetuning.py \
--model_name_or_path="facebook/wav2vec2-large-xlsr-53" \
--dataset_config_name="de" \
--output_dir=./wav2vec2-large-xlsr-german \
--preprocessing_num_workers="16" \
--overwrite_output_dir \
--num_train_epochs="20" \
--per_device_train_batch_size="64" \
--per_device_eval_batch_size="32" \
--learning_rate="1e-4" \
--warmup_steps="500" \
--evaluation_strategy="steps" \
--save_steps="5000" \
--eval_steps="5000" \
--logging_steps="1000" \
--save_total_limit="3" \
--freeze_feature_extractor \
--activation_dropout="0.055" \
--attention_dropout="0.094" \
--feat_proj_dropout="0.04" \
--layerdrop="0.04" \
--mask_time_prob="0.08" \
--gradient_checkpointing="1" \
--fp16 \
--do_train \
--do_eval \
--dataloader_num_workers="16" \
--group_by_length
📄 License
This project is licensed under the apache - 2.0 license.
Voice Activity Detection
MIT
Voice activity detection model based on pyannote.audio 2.1, used to identify speech activity segments in audio
Speech Recognition
V
pyannote
7.7M
181
Wav2vec2 Large Xlsr 53 Portuguese
Apache-2.0
This is a fine-tuned XLSR-53 large model for Portuguese speech recognition tasks, trained on the Common Voice 6.1 dataset, supporting Portuguese speech-to-text conversion.
Speech Recognition Other
W
jonatasgrosman
4.9M
32
Whisper Large V3
Apache-2.0
Whisper is an advanced automatic speech recognition (ASR) and speech translation model proposed by OpenAI, trained on over 5 million hours of labeled data, with strong cross-dataset and cross-domain generalization capabilities.
Speech Recognition Supports Multiple Languages
W
openai
4.6M
4,321
Whisper Large V3 Turbo
MIT
Whisper is a state-of-the-art automatic speech recognition (ASR) and speech translation model developed by OpenAI, trained on over 5 million hours of labeled data, demonstrating strong generalization capabilities in zero-shot settings.
Speech Recognition
Transformers Supports Multiple Languages

W
openai
4.0M
2,317
Wav2vec2 Large Xlsr 53 Russian
Apache-2.0
A Russian speech recognition model fine-tuned from facebook/wav2vec2-large-xlsr-53, supporting 16kHz sampled audio input
Speech Recognition Other
W
jonatasgrosman
3.9M
54
Wav2vec2 Large Xlsr 53 Chinese Zh Cn
Apache-2.0
A Chinese speech recognition model fine-tuned based on facebook/wav2vec2-large-xlsr-53, supporting 16kHz sampling rate audio input.
Speech Recognition Chinese
W
jonatasgrosman
3.8M
110
Wav2vec2 Large Xlsr 53 Dutch
Apache-2.0
A Dutch speech recognition model fine-tuned based on facebook/wav2vec2-large-xlsr-53, trained on the Common Voice and CSS10 datasets, supporting 16kHz audio input.
Speech Recognition Other
W
jonatasgrosman
3.0M
12
Wav2vec2 Large Xlsr 53 Japanese
Apache-2.0
Japanese speech recognition model fine-tuned from facebook/wav2vec2-large-xlsr-53, supporting 16kHz sampling rate audio input
Speech Recognition Japanese
W
jonatasgrosman
2.9M
33
Mms 300m 1130 Forced Aligner
A text-to-audio forced alignment tool based on Hugging Face pre-trained models, supporting multiple languages with high memory efficiency
Speech Recognition
Transformers Supports Multiple Languages

M
MahmoudAshraf
2.5M
50
Wav2vec2 Large Xlsr 53 Arabic
Apache-2.0
Arabic speech recognition model fine-tuned from facebook/wav2vec2-large-xlsr-53, trained on Common Voice and Arabic speech corpus
Speech Recognition Arabic
W
jonatasgrosman
2.3M
37
Featured Recommended AI Models