Open-source model wav2vec2-large-xlsr-kazakh - Accurately realize automatic speech recognition for Kazakh

Wav2vec2 Large Xlsr Kazakh

Developed by aismlv

This is a Kazakh automatic speech recognition (ASR) model fine-tuned from facebook/wav2vec2-large-xlsr-53, trained on the Kazakh speech corpus v1.1 with a test WER of 19.65%.

Speech Recognition OtherOpen Source License:Apache-2.0 #Kazakh speech recognition #Low word error rate #XLSR fine-tuning

Downloads 12.08k

Release Time : 3/2/2022

Model Overview

This model is specifically designed for automatic speech recognition tasks in Kazakh, supporting voice input with a 16kHz sampling rate.

Model Features

High-accuracy Kazakh recognition

Achieves a word error rate (WER) of 19.65% on the Kazakh speech corpus v1.1

Based on XLSR-53 architecture

Utilizes a large-scale cross-lingual speech representation learning model for fine-tuning

No language model required

Can be used directly without additional language model support

Model Capabilities

Kazakh speech recognition

16kHz audio processing

Use Cases

Speech-to-text

Kazakh speech transcription

Convert Kazakh speech content into text

Word error rate 19.65%

Voice assistant

Kazakh voice command recognition

Used for command recognition in Kazakh voice assistant systems

🚀 Wav2Vec2-Large-XLSR-53-Kazakh

This model is fine - tuned from facebook/wav2vec2-large-xlsr-53 for Kazakh Automatic Speech Recognition (ASR) using the Kazakh Speech Corpus v1.1.

Key Information

Property	Details
Model Type	Audio, Automatic Speech Recognition
Training Data	Kazakh Speech Corpus v1.1
Base Model	facebook/wav2vec2-large-xlsr-53
Metrics	Word Error Rate (WER)

Model Index

Name: Wav2Vec2 - XLSR - 53 Kazakh by adilism
Results:
- Task:
  - Type: Automatic Speech Recognition
  - Name: Speech Recognition
- Dataset:
  - Name: Kazakh Speech Corpus v1.1
  - Type: kazakh_speech_corpus
  - Args: kk
- Metrics:
  - Type: WER
  - Value: 19.65
  - Name: Test WER

Important Note

⚠️ Important Note

When using this model, make sure that your speech input is sampled at 16kHz.

🚀 Quick Start

The model can be used directly (without a language model) as described below.

💻 Usage Examples

Basic Usage

import torch
import torchaudio
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

from utils import get_test_dataset

test_dataset = get_test_dataset("ISSAI_KSC_335RS_v1.1")

processor = Wav2Vec2Processor.from_pretrained("wav2vec2-large-xlsr-kazakh")
model = Wav2Vec2ForCTC.from_pretrained("wav2vec2-large-xlsr-kazakh")


# Preprocessing the datasets.
# We need to read the audio files as arrays
def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = torchaudio.load(batch["path"])
    batch["speech"] = torchaudio.transforms.Resample(sampling_rate, 16_000)(speech_array).squeeze().numpy()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
    logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits

predicted_ids = torch.argmax(logits, dim=-1)

print("Prediction:", processor.batch_decode(predicted_ids))
print("Reference:", test_dataset["sentence"][:2])

🔧 Evaluation

The model can be evaluated on the test set of Kazakh Speech Corpus v1.1. To evaluate, download the archive, untar and pass the path to data to get_test_dataset as shown below:

import torch
import torchaudio
from datasets import load_dataset, load_metric
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import re

from utils import get_test_dataset

test_dataset = get_test_dataset("ISSAI_KSC_335RS_v1.1")
wer = load_metric("wer")

processor = Wav2Vec2Processor.from_pretrained("adilism/wav2vec2-large-xlsr-kazakh")
model = Wav2Vec2ForCTC.from_pretrained("adilism/wav2vec2-large-xlsr-kazakh")
model.to("cuda")


# Preprocessing the datasets.
# We need to read the audio files as arrays
def speech_file_to_array_fn(batch):
    batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower()
    speech_array, sampling_rate = torchaudio.load(batch["path"])
    batch["speech"] = torchaudio.transforms.Resample(sampling_rate, 16_000)(speech_array).squeeze().numpy()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)

def evaluate(batch):
    inputs = processor(batch["text"], sampling_rate=16_000, return_tensors="pt", padding=True)

    with torch.no_grad():
        logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits

    pred_ids = torch.argmax(logits, dim=-1)
    batch["pred_strings"] = processor.batch_decode(pred_ids)
    return batch

result = test_dataset.map(evaluate, batched=True, batch_size=8)

print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))

Test Result: 19.65%

🔧 Training

The Kazakh Speech Corpus v1.1 train dataset was used for training.

📄 License

This project is licensed under the Apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご