wav2vec2-large-xlsr-53-sakha: An open-source Yakut language speech recognition model with accurate recognition and low error rate!

Wav2vec2 Large Xlsr 53 Sakha

Developed by anton-l

Yakut speech recognition model fine-tuned from XLSR-53 large model, with 32.23% word error rate

Speech Recognition OtherOpen Source License:Apache-2.0 #Yakut speech recognition #Low-resource language processing #XLSR-53 fine-tuning

Downloads 25

Release Time : 3/2/2022

Model Overview

This model is an automatic speech recognition (ASR) model fine-tuned for the Yakut language using the Common Voice dataset, based on Facebook's wav2vec2-large-xlsr-53 model.

Model Features

Low-resource language support

Specially optimized for low-resource languages like Yakut

No language model required

Can be used directly without additional language model support

16kHz sampling rate support

Optimized for 16kHz sampled audio input

Model Capabilities

Yakut speech recognition

Speech-to-text

Automatic speech transcription

Use Cases

Speech transcription

Yakut speech transcription

Convert Yakut speech content into text

Word error rate 32.23%

Voice assistant

Yakut voice command recognition

Basic recognition function for Yakut voice assistants

🚀 Wav2Vec2-Large-XLSR-53-Sakha

This model is fine-tuned from facebook/wav2vec2-large-xlsr-53 on Sakha using the Common Voice dataset. It provides a solution for automatic speech recognition in the Sakha language.

📋 Model Information

Property	Details
Language	Sakha
Datasets	Common Voice
Metrics	WER (Word Error Rate)
Tags	audio, automatic-speech-recognition, speech, xlsr-fine-tuning-week
License	Apache-2.0
Model Name	Sakha XLSR Wav2Vec2 Large 53 by Anton Lozhkov
Results	Task: Speech Recognition (automatic-speech-recognition) Dataset: Common Voice sah Metrics: Test WER = 32.23

🚀 Quick Start

When using this model, make sure that your speech input is sampled at 16kHz.

✨ Features

Fine-tuned on the Sakha language using the Common Voice dataset.
Can be used directly for automatic speech recognition without a language model.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

The model can be used directly (without a language model) as follows:

import torch
import torchaudio
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

test_dataset = load_dataset("common_voice", "sah", split="test[:2%]")

processor = Wav2Vec2Processor.from_pretrained("anton-l/wav2vec2-large-xlsr-53-sakha")
model = Wav2Vec2ForCTC.from_pretrained("anton-l/wav2vec2-large-xlsr-53-sakha")

resampler = torchaudio.transforms.Resample(48_000, 16_000)

# Preprocessing the datasets.
# We need to read the audio files as arrays
def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = torchaudio.load(batch["path"])
    batch["speech"] = resampler(speech_array).squeeze().numpy()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
    logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits

predicted_ids = torch.argmax(logits, dim=-1)

print("Prediction:", processor.batch_decode(predicted_ids))
print("Reference:", test_dataset["sentence"][:2])

Advanced Usage

The model can be evaluated as follows on the Sakha test data of Common Voice.

import torch
import torchaudio
import urllib.request
import tarfile
import pandas as pd
from tqdm.auto import tqdm
from datasets import load_metric
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

# Download the raw data instead of using HF datasets to save disk space 
data_url = "https://voice-prod-bundler-ee1969a6ce8178826482b88e843c335139bd3fb4.s3.amazonaws.com/cv-corpus-6.1-2020-12-11/sah.tar.gz"
filestream = urllib.request.urlopen(data_url)
data_file = tarfile.open(fileobj=filestream, mode="r|gz")
data_file.extractall()

wer = load_metric("wer")

processor = Wav2Vec2Processor.from_pretrained("anton-l/wav2vec2-large-xlsr-53-sakha")
model = Wav2Vec2ForCTC.from_pretrained("anton-l/wav2vec2-large-xlsr-53-sakha")
model.to("cuda")

cv_test = pd.read_csv("cv-corpus-6.1-2020-12-11/sah/test.tsv", sep='\t')
clips_path = "cv-corpus-6.1-2020-12-11/sah/clips/"

def clean_sentence(sent):
    sent = sent.lower()
    # replace non-alpha characters with space
    sent = "".join(ch if ch.isalpha() else " " for ch in sent)
    # remove repeated spaces
    sent = " ".join(sent.split())
    return sent

targets = []
preds = []

for i, row in tqdm(cv_test.iterrows(), total=cv_test.shape[0]):
    row["sentence"] = clean_sentence(row["sentence"])
    speech_array, sampling_rate = torchaudio.load(clips_path + row["path"])
    resampler = torchaudio.transforms.Resample(sampling_rate, 16_000)
    row["speech"] = resampler(speech_array).squeeze().numpy()

    inputs = processor(row["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

    with torch.no_grad():
        logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits

    pred_ids = torch.argmax(logits, dim=-1)

    targets.append(row["sentence"])
    preds.append(processor.batch_decode(pred_ids)[0])

print("WER: {:2f}".format(100 * wer.compute(predictions=preds, references=targets)))

Test Result: 32.23 %

📚 Documentation

The Common Voice train and validation datasets were used for training.

📄 License

This model is licensed under the Apache-2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご