Wav2Vec2-XLS-R-300M-Khmer Open-Source Model - Free Deployment for Automatic Khmer Speech Recognition

Wav2vec2 Xls R 300m Khmer

Developed by vitouphy

This is a fine-tuned facebook/wav2vec2-xls-r-300m model based on the OpenSLR dataset, specifically designed for automatic speech recognition tasks in Khmer (km).

Speech Recognition

Transformers

OtherOpen Source License:Apache-2.0 #Khmer speech recognition #Small data fine-tuning #Low-resource optimization

Downloads 2,321

Release Time : 3/2/2022

Model Overview

This model is an automatic speech recognition system for Khmer, trained on a limited dataset (approximately 4 hours) and demonstrating decent recognition capabilities.

Model Features

Efficient training with small data

Achieved decent recognition results using only about 4 hours of training data (actual training duration 3.2 hours)

Language model support

Supports decoding with a language model (kenlm), significantly improving recognition accuracy

Lightweight deployment

Based on a 300M parameter model, relatively lightweight and suitable for practical deployment

Model Capabilities

Khmer speech recognition

Audio to text conversion

Speech content analysis

Use Cases

Speech transcription

Khmer speech to text

Convert Khmer speech content into text transcripts

WER 25.7%, CER 7.03%

Speech analysis

Khmer speech content analysis

Analyze keywords and content in Khmer speech

🚀 xls-r-300m-km

This is a fine - tuned model based on facebook/wav2vec2-xls-r-300m for automatic speech recognition, achieving good results on the openslr dataset.

🚀 Quick Start

This model is a fine - tuned version of facebook/wav2vec2-xls-r-300m on the openslr dataset. It achieves the following results on the evaluation set:

Loss: 0.3281
Wer: 0.3462

Evaluation results on OpenSLR "test" (self - split 10%) (Running ./eval.py)

WER: 0.3216977389924633
CER: 0.08653361193169537

Evaluation results with language model on OpenSLR "test" (self - split 10%) (Running ./eval.py)

WER: 0.257040856802856
CER: 0.07025001801282513

✨ Features

Fine - tuned Model: Based on facebook/wav2vec2-xls-r-300m, fine - tuned on the openslr dataset.
Good Performance: Achieves relatively good WER and CER results on the evaluation set.

📦 Installation

Install the following libraries on top of HuggingFace Transformers for the supports of language model.

pip install pyctcdecode
pip install https://github.com/kpu/kenlm/archive/master.zip

💻 Usage Examples

Basic Usage

Using HuggingFace's pipeline, this will cover everything end - to - end from raw audio input to text output.

from transformers import pipeline

# Load the model
pipe = pipeline(model="vitouphy/wav2vec2-xls-r-300m-khmer")

# Process raw audio
output = pipe("sound_file.wav", chunk_length_s=10, stride_length_s=(4, 2))

Advanced Usage

More custom way to predict phonemes.

from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC 
import librosa
import torch

# load model and processor
processor = Wav2Vec2Processor.from_pretrained("vitouphy/wav2vec2-xls-r-300m-khmer")
model = Wav2Vec2ForCTC.from_pretrained("vitouphy/wav2vec2-xls-r-300m-khmer")

# Read and process the input
speech_array, sampling_rate = librosa.load("sound_file.wav", sr=16_000)
inputs = processor(speech_array, sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
    logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits

predicted_ids = torch.argmax(logits, axis=-1)      
predicted_sentences = processor.batch_decode(predicted_ids)
print(predicted_sentences)

📚 Documentation

Intended uses & limitations

The data used for this model is only around 4 hours of recordings.

We split into 80/10/10. Hence, the training hour is 3.2 hours, which is very very small.
Yet, its performance is not too bad. Quite interesting for such small dataset, actually. You can try it out.
Its limitation is:
- Rare characters, e.g. ឬស្សី ឪឡឹក
- Speech needs to be clear and articulate.
More data to cover more vocabulary and character may help improve this system.

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e - 05
train_batch_size: 8
eval_batch_size: 8
seed: 42
gradient_accumulation_steps: 4
total_train_batch_size: 32
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e - 08
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 1000
num_epochs: 100
mixed_precision_training: Native AMP

Training results

Training Loss	Epoch	Step	Validation Loss	Wer
5.0795	5.47	400	4.4121	1.0
3.5658	10.95	800	3.5203	1.0
3.3689	16.43	1200	2.8984	0.9996
2.01	21.91	1600	1.0041	0.7288
1.6783	27.39	2000	0.6941	0.5989
1.527	32.87	2400	0.5599	0.5282
1.4278	38.35	2800	0.4827	0.4806
1.3458	43.83	3200	0.4429	0.4532
1.2893	49.31	3600	0.4156	0.4330
1.2441	54.79	4000	0.4020	0.4040
1.188	60.27	4400	0.3777	0.3866
1.1628	65.75	4800	0.3607	0.3858
1.1324	71.23	5200	0.3534	0.3604
1.0969	76.71	5600	0.3428	0.3624
1.0897	82.19	6000	0.3387	0.3567
1.0625	87.66	6400	0.3339	0.3499
1.0601	93.15	6800	0.3288	0.3446
1.0474	98.62	7200	0.3281	0.3462

Framework versions

Transformers 4.17.0.dev0
Pytorch 1.10.2+cu102
Datasets 1.18.2.dev0
Tokenizers 0.11.0

📄 License

This model is released under the Apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご