wav2vec2-xls-r-300m-cs-250 Open-source Speech Recognition Model

Wav2vec2 Xls R 300m Cs 250

Developed by comodoro

This is an automatic speech recognition model fine-tuned on Czech datasets based on facebook/wav2vec2-xls-r-300m, supporting 16kHz sampled audio input.

Speech Recognition

Transformers

OtherOpen Source License:Apache-2.0 #Czech speech recognition #Low word error rate #Multi-dataset training

Downloads 248.66k

Release Time : 3/2/2022

Model Overview

This model is designed for Czech automatic speech recognition, fine-tuned on datasets like Common Voice 8.0, and can be used directly or with a language model.

Model Features

Multi-dataset training

Trained on multiple Czech datasets including Common Voice 8.0, OVM, PSCR, and Vystadial2016

High performance

Achieves 7.3% word error rate and 2.1% character error rate on the Common Voice 8.0 test set

Direct usage

Capable of speech recognition without requiring a language model

Model Capabilities

Czech speech recognition

16kHz sampled audio processing

Direct inference without language model

Use Cases

Speech transcription

Speech to text

Convert Czech speech content into text

Word error rate 7.3%, character error rate 2.1%

Speech analysis

Speech content analysis

Analyze Czech speech content

🚀 Czech wav2vec2-xls-r-300m-cs-250

This model is a fine - tuned version of facebook/wav2vec2-xls-r-300m on the common_voice 8.0 dataset and other listed datasets, aiming at automatic speech recognition in Czech.

🚀 Quick Start

This model is a fine - tuned version of facebook/wav2vec2-xls-r-300m on the common_voice 8.0 dataset as well as other datasets listed below.

It achieves the following results on the evaluation set:

Loss: 0.1271
Wer: 0.1475
Cer: 0.0329

The eval.py script results using a LM are:

WER: 0.07274312090176113
CER: 0.021207369275558875

✨ Features

Fine - Tuned Model: Based on the facebook/wav2vec2-xls-r-300m base model, fine - tuned on multiple Czech datasets.
Good Performance: Achieved relatively low WER and CER on the evaluation set.

💻 Usage Examples

Basic Usage

import torch
import torchaudio
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

test_dataset = load_dataset("mozilla-foundation/common_voice_8_0", "cs", split="test[:2%]")

processor = Wav2Vec2Processor.from_pretrained("comodoro/wav2vec2-xls-r-300m-cs-250")
model = Wav2Vec2ForCTC.from_pretrained("comodoro/wav2vec2-xls-r-300m-cs-250")

resampler = torchaudio.transforms.Resample(48_000, 16_000)

# Preprocessing the datasets.
# We need to read the aduio files as arrays
def speech_file_to_array_fn(batch):
	speech_array, sampling_rate = torchaudio.load(batch["path"])
	batch["speech"] = resampler(speech_array).squeeze().numpy()
	return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset[:2]["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
	logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits

predicted_ids = torch.argmax(logits, dim=-1)

print("Prediction:", processor.batch_decode(predicted_ids))
print("Reference:", test_dataset[:2]["sentence"])

📚 Documentation

Evaluation

The model can be evaluated using the attached eval.py script:

python eval.py --model_id comodoro/wav2vec2-xls-r-300m-cs-250 --dataset mozilla-foundation/common-voice_8_0 --split test --config cs

Training and evaluation data

The Common Voice 8.0 train and validation datasets were used for training, as well as the following datasets:

Šmídl, Luboš and Pražák, Aleš, 2013, OVM – Otázky Václava Moravce, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University, http://hdl.handle.net/11858/00-097C-0000-000D-EC98-3.
Pražák, Aleš and Šmídl, Luboš, 2012, Czech Parliament Meetings, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University, http://hdl.handle.net/11858/00-097C-0000-0005-CF9C-4.
Plátek, Ondřej; Dušek, Ondřej and Jurčíček, Filip, 2016, Vystadial 2016 – Czech data, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University, http://hdl.handle.net/11234/1-1740.

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 0.0001
train_batch_size: 32
eval_batch_size: 8
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e - 08
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 800
num_epochs: 5
mixed_precision_training: Native AMP

Training results

Training Loss	Epoch	Step	Validation Loss	Wer	Cer
3.4203	0.16	800	3.3148	1.0	1.0
2.8151	0.32	1600	0.8508	0.8938	0.2345
0.9411	0.48	2400	0.3335	0.3723	0.0847
0.7408	0.64	3200	0.2573	0.2840	0.0642
0.6516	0.8	4000	0.2365	0.2581	0.0595
0.6242	0.96	4800	0.2039	0.2433	0.0541
0.5754	1.12	5600	0.1832	0.2156	0.0482
0.5626	1.28	6400	0.1827	0.2091	0.0463
0.5342	1.44	7200	0.1744	0.2033	0.0468
0.4965	1.6	8000	0.1705	0.1963	0.0444
0.5047	1.76	8800	0.1604	0.1889	0.0422
0.4814	1.92	9600	0.1604	0.1827	0.0411
0.4471	2.09	10400	0.1566	0.1822	0.0406
0.4509	2.25	11200	0.1619	0.1853	0.0432
0.4415	2.41	12000	0.1513	0.1764	0.0397
0.4313	2.57	12800	0.1515	0.1739	0.0392
0.4163	2.73	13600	0.1445	0.1695	0.0377
0.4142	2.89	14400	0.1478	0.1699	0.0385
0.4184	3.05	15200	0.1430	0.1669	0.0376
0.3886	3.21	16000	0.1433	0.1644	0.0374
0.3795	3.37	16800	0.1426	0.1648	0.0373
0.3859	3.53	17600	0.1357	0.1604	0.0361
0.3762	3.69	18400	0.1344	0.1558	0.0349
0.384	3.85	19200	0.1379	0.1576	0.0359
0.3762	4.01	20000	0.1344	0.1539	0.0346
0.3559	4.17	20800	0.1339	0.1525	0.0351
0.3683	4.33	21600	0.1315	0.1518	0.0342
0.3572	4.49	22400	0.1307	0.1507	0.0342
0.3494	4.65	23200	0.1294	0.1491	0.0335
0.3476	4.81	24000	0.1287	0.1491	0.0336
0.3475	4.97	24800	0.1271	0.1475	0.0329

Framework versions

Transformers 4.16.2
Pytorch 1.10.1+cu102
Datasets 1.18.3
Tokenizers 0.11.0

📄 License

The model is released under the apache - 2.0 license.

📦 Model Information

Property	Details
Model Type	Czech comodoro Wav2Vec2 XLSR 300M 250h data
Base Model	facebook/wav2vec2-xls-r-300m
Training Datasets	mozilla - foundation/common_voice_8_0, ovm, pscr, vystadial2016
Evaluation Results	Common Voice 8: Test WER = 7.3, Test CER = 2.1 Robust Speech Event - Dev Data: Test WER = 43.44 Robust Speech Event - Test Data: Test WER = 38.5

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご