wav2vec2-large-xlsr-53-tatar Open-Source Speech Recognition Model

Wav2vec2 Large Xlsr 53 Tatar

Developed by anton-l

A speech recognition model fine-tuned on the Tatar Common Voice dataset based on Facebook's wav2vec2-large-xlsr-53 model

Speech Recognition OtherOpen Source License:Apache-2.0 #Tatar speech recognition #Low-resource language support #High accuracy WER26.76

Downloads 25

Release Time : 3/2/2022

Model Overview

This is a model for Tatar automatic speech recognition (ASR), fine-tuned based on Facebook's wav2vec2-large-xlsr-53 architecture, supporting 16kHz sampled speech input.

Model Features

Dedicated Tatar Speech Recognition

A speech recognition model specifically optimized for Tatar, achieving a WER of 26.76% on the Common Voice Tatar test set

Based on XLSR Architecture

Utilizes cross-lingual speech representation (XLSR) technology to capture Tatar speech features

No Language Model Required

Can be used directly without additional language model support

Model Capabilities

Tatar speech recognition

Speech-to-text

16kHz audio processing

Use Cases

Speech Transcription

Tatar Speech Transcription

Convert Tatar speech content into text

Achieves a 26.76% word error rate on the Common Voice test set

Voice Assistants

Tatar Voice Command Recognition

Speech recognition module for Tatar voice assistants or voice control systems

🚀 Wav2Vec2-Large-XLSR-53-Tatar

This model is fine-tuned from facebook/wav2vec2-large-xlsr-53 on Tatar using the Common Voice dataset. It can be used for automatic speech recognition of Tatar language.

📦 Model Information

Property	Details
Model Type	Wav2Vec2-Large-XLSR-53-Tatar
Training Data	Common Voice (train and validation datasets)
Metrics	Word Error Rate (WER)
Tags	audio, automatic-speech-recognition, speech, xlsr-fine-tuning-week
License	apache-2.0

📊 Model Index

Name: Tatar XLSR Wav2Vec2 Large 53 by Anton Lozhkov
Results:
- Task:
  - Name: Speech Recognition
  - Type: automatic-speech-recognition
- Dataset:
  - Name: Common Voice tt
  - Type: common_voice
  - Args: tt
- Metrics:
  - Name: Test WER
  - Type: wer
  - Value: 26.76

🚀 Quick Start

Fine-tuned facebook/wav2vec2-large-xlsr-53 on Tatar using the Common Voice dataset. When using this model, make sure that your speech input is sampled at 16kHz.

💻 Usage Examples

Basic Usage

import torch
import torchaudio
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

test_dataset = load_dataset("common_voice", "tt", split="test[:2%]")

processor = Wav2Vec2Processor.from_pretrained("anton-l/wav2vec2-large-xlsr-53-tatar")
model = Wav2Vec2ForCTC.from_pretrained("anton-l/wav2vec2-large-xlsr-53-tatar")

resampler = torchaudio.transforms.Resample(48_000, 16_000)

# Preprocessing the datasets.
# We need to read the audio files as arrays
def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = torchaudio.load(batch["path"])
    batch["speech"] = resampler(speech_array).squeeze().numpy()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
    logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits

predicted_ids = torch.argmax(logits, dim=-1)

print("Prediction:", processor.batch_decode(predicted_ids))
print("Reference:", test_dataset["sentence"][:2])

🔧 Evaluation

The model can be evaluated as follows on the Tatar test data of Common Voice.

import torch
import torchaudio
import urllib.request
import tarfile
import pandas as pd
from tqdm.auto import tqdm
from datasets import load_metric
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

# Download the raw data instead of using HF datasets to save disk space 
data_url = "https://voice-prod-bundler-ee1969a6ce8178826482b88e843c335139bd3fb4.s3.amazonaws.com/cv-corpus-6.1-2020-12-11/tt.tar.gz"
filestream = urllib.request.urlopen(data_url)
data_file = tarfile.open(fileobj=filestream, mode="r|gz")
data_file.extractall()

wer = load_metric("wer")

processor = Wav2Vec2Processor.from_pretrained("anton-l/wav2vec2-large-xlsr-53-tatar")
model = Wav2Vec2ForCTC.from_pretrained("anton-l/wav2vec2-large-xlsr-53-tatar")
model.to("cuda")

cv_test = pd.read_csv("cv-corpus-6.1-2020-12-11/tt/test.tsv", sep='\t')
clips_path = "cv-corpus-6.1-2020-12-11/tt/clips/"

def clean_sentence(sent):
    sent = sent.lower()
    # 'ё' is equivalent to 'е'
    sent = sent.replace('ё', 'е')
    # replace non-alpha characters with space
    sent = "".join(ch if ch.isalpha() else " " for ch in sent)
    # remove repeated spaces
    sent = " ".join(sent.split())
    return sent

targets = []
preds = []

for i, row in tqdm(cv_test.iterrows(), total=cv_test.shape[0]):
    row["sentence"] = clean_sentence(row["sentence"])
    speech_array, sampling_rate = torchaudio.load(clips_path + row["path"])
    resampler = torchaudio.transforms.Resample(sampling_rate, 16_000)
    row["speech"] = resampler(speech_array).squeeze().numpy()

    inputs = processor(row["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

    with torch.no_grad():
        logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits

    pred_ids = torch.argmax(logits, dim=-1)

    targets.append(row["sentence"])
    preds.append(processor.batch_decode(pred_ids)[0])

print("WER: {:2f}".format(100 * wer.compute(predictions=preds, references=targets)))

Test Result: 26.76 %

📚 Training

The Common Voice train and validation datasets were used for training.

📄 License

This model is licensed under the apache-2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご