wav2vec2-large-xlsr-kyrgyz Open-source Speech Recognition Model

Wav2vec2 Large Xlsr Kyrgyz

Developed by aismlv

A Kyrgyz speech recognition model fine-tuned from facebook/wav2vec2-large-xlsr-53, trained on Common Voice dataset with a word error rate of 34.08%.

Speech Recognition OtherOpen Source License:Apache-2.0 #Kyrgyz speech recognition #Low-resource language processing #XLSR fine-tuned model

Downloads 571

Release Time : 3/2/2022

Model Overview

This is a specialized model for Kyrgyz speech recognition, based on the Wav2Vec2-XLSR architecture, suitable for converting Kyrgyz audio into text.

Model Features

High Accuracy Kyrgyz Recognition

A speech recognition model specifically optimized for Kyrgyz language, achieving 34.08% word error rate on Common Voice test set

Based on XLSR Architecture

Utilizes large-scale cross-lingual representation learning pre-trained model with powerful speech feature extraction capabilities

16kHz Sampling Rate Support

Optimized for 16kHz sampled audio input, ensure matching audio sampling rate when using

Model Capabilities

Kyrgyz speech recognition

Audio to text

Automatic speech transcription

Use Cases

Speech Transcription

Kyrgyz Speech Transcription

Convert Kyrgyz speech content into editable text format

Word error rate 34.08%

Voice Assistants

Kyrgyz Voice Command Recognition

Provide speech recognition capability for Kyrgyz voice assistants

🚀 Wav2Vec2-Large-XLSR-53-Kyrgyz

This model is fine-tuned from facebook/wav2vec2-large-xlsr-53 on Kyrgyz using the Common Voice dataset, aiming to enhance speech recognition performance in the Kyrgyz language.

Property	Details
Model Type	Fine-tuned Wav2Vec2-Large-XLSR-53 for Kyrgyz
Training Data	Common Voice `train` and `validation` datasets
License	apache-2.0

⚠️ Important Note

When using this model, make sure that your speech input is sampled at 16kHz.

🚀 Quick Start

The model can be used directly (without a language model) as follows:

Basic Usage

import torch
import torchaudio
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

test_dataset = load_dataset("common_voice", "ky", split="test[:2%]")

processor = Wav2Vec2Processor.from_pretrained("adilism/wav2vec2-large-xlsr-kyrgyz")
model = Wav2Vec2ForCTC.from_pretrained("adilism/wav2vec2-large-xlsr-kyrgyz")

resampler = torchaudio.transforms.Resample(48_000, 16_000)

# Preprocessing the datasets.
# We need to read the aduio files as arrays
def speech_file_to_array_fn(batch):
	speech_array, sampling_rate = torchaudio.load(batch["path"])
	batch["speech"] = resampler(speech_array).squeeze().numpy()
	return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
	logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits

predicted_ids = torch.argmax(logits, dim=-1)

print("Prediction:", processor.batch_decode(predicted_ids))
print("Reference:", test_dataset["sentence"][:2])

📚 Documentation

Evaluation

The model can be evaluated as follows on the Kyrgyz test data of Common Voice:

import torch
import torchaudio
from datasets import load_dataset, load_metric
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import re

test_dataset = load_dataset("common_voice", "ky", split="test")
wer = load_metric("wer")

processor = Wav2Vec2Processor.from_pretrained("adilism/wav2vec2-large-xlsr-kyrgyz")
model = Wav2Vec2ForCTC.from_pretrained("adilism/wav2vec2-large-xlsr-kyrgyz")
model.to("cuda")

chars_to_ignore = [",", "?", ".", "!", "-", ";", ":", "—", "–", "”"]
chars_to_ignore_regex = f'[{"".join(chars_to_ignore)}]'
resampler = torchaudio.transforms.Resample(48_000, 16_000)

# Preprocessing the datasets.
# We need to read the aduio files as arrays
def speech_file_to_array_fn(batch):
	batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower()
	speech_array, sampling_rate = torchaudio.load(batch["path"])
	batch["speech"] = resampler(speech_array).squeeze().numpy()
	return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)

# Preprocessing the datasets.
# We need to read the aduio files as arrays
def evaluate(batch):
	inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

	with torch.no_grad():
		logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits

	pred_ids = torch.argmax(logits, dim=-1)
	batch["pred_strings"] = processor.batch_decode(pred_ids)
	return batch

result = test_dataset.map(evaluate, batched=True, batch_size=8)

print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))

Test Result: 34.08 %

Training

The Common Voice train and validation datasets were used for training.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご