wav2vec2-large-xlsr-53-tatar Open-source Speech Recognition Model - Free Support for 16kHz Tatar Language Speech Input

Wav2vec2 Large Xlsr 53 Tatar

Developed by crang

An automatic speech recognition model fine-tuned on Tatar language based on facebook/wav2vec2-large-xlsr-53, supporting 16kHz sampled audio input.

Speech Recognition OtherOpen Source License:Apache-2.0 #Tatar speech recognition #Low-resource language support #XLSR fine-tuning

Downloads 163

Release Time : 3/2/2022

Model Overview

This is an optimized automatic speech recognition model for the Tatar language, fine-tuned on the XLSR-53 architecture, suitable for Tatar speech-to-text tasks.

Model Features

Tatar language optimization

Specially fine-tuned for Tatar language to improve recognition accuracy

No language model required

Can be used directly without additional language model support

16kHz sampling rate support

Supports processing of 16kHz sampled audio input

Model Capabilities

Tatar speech recognition

Speech-to-text

Automatic speech recognition

Use Cases

Speech transcription

Tatar speech transcription

Convert Tatar speech content into text

Word Error Rate (WER) 30.93%

Voice assistant

Tatar voice command recognition

Speech recognition module for Tatar voice assistants or voice control systems

🚀 Wav2Vec2-Large-XLSR-53-Tatar

This model is a fine - tuned version of facebook/wav2vec2-large-xlsr-53 on Tatar using the Common Voice dataset. Ensure your speech input is sampled at 16kHz when using this model.

🚀 Quick Start

This model is a fine - tuned version of facebook/wav2vec2-large-xlsr-53 on Tatar using the Common Voice dataset. When using this model, make sure that your speech input is sampled at 16kHz.

✨ Features

Language: Tatar
Datasets: Common Voice
Metrics: Word Error Rate (WER)
Tags: audio, automatic - speech - recognition, speech, xlsr - fine - tuning - week

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

import torch
import torchaudio
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

test_dataset = load_dataset("common_voice", "tt", split="test[:2%]")

processor = Wav2Vec2Processor.from_pretrained("crang/wav2vec2-large-xlsr-53-tatar")
model = Wav2Vec2ForCTC.from_pretrained("crang/wav2vec2-large-xlsr-53-tatar")

resampler = torchaudio.transforms.Resample(48_000, 16_000)

# Preprocessing the datasets.
# We need to read the aduio files as arrays
def speech_file_to_array_fn(batch):
	speech_array, sampling_rate = torchaudio.load(batch["path"])
	batch["speech"] = resampler(speech_array).squeeze().numpy()
	return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
	logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits

predicted_ids = torch.argmax(logits, dim=-1)

print("Prediction:", processor.batch_decode(predicted_ids))
print("Reference:", test_dataset["sentence"][:2])

Advanced Usage

import torch
import torchaudio
from datasets import load_dataset, load_metric
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import re

test_dataset = load_dataset("common_voice", "tt", split="test")
wer = load_metric("wer")

processor = Wav2Vec2Processor.from_pretrained("crang/wav2vec2-large-xlsr-53-tatar")
model = Wav2Vec2ForCTC.from_pretrained("crang/wav2vec2-large-xlsr-53-tatar")
model.to("cuda")

chars_to_ignore_regex = '[\,\?\.\!\-\u2013\u2014\;\:\"\\%\\\]'
resampler = torchaudio.transforms.Resample(48_000, 16_000)

# Preprocessing the datasets.
# We need to read the aduio files as arrays
def speech_file_to_array_fn(batch):
	batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower()
	speech_array, sampling_rate = torchaudio.load(batch["path"])
	batch["speech"] = resampler(speech_array).squeeze().numpy()
	return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)

# Preprocessing the datasets.
# We need to read the aduio files as arrays
def evaluate(batch):
	inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

	with torch.no_grad():
		logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits

	pred_ids = torch.argmax(logits, dim=-1)
	batch["pred_strings"] = processor.batch_decode(pred_ids)
	return batch

result = test_dataset.map(evaluate, batched=True, batch_size=8)

print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))

Test Result

The Word Error Rate (WER) on the test data is 30.93 %.

📚 Documentation

The Common Voice train and validation datasets were used for training.

📄 License

This model is licensed under the Apache 2.0 license.

📋 Model Information

Property	Details
Model Name	Tatar XLSR Wav2Vec2 Large 53
Model Type	Fine - tuned Wav2Vec2 - Large - XLSR - 53 on Tatar
Training Datasets	Common Voice (train and validation sets)
Evaluation Metric	Word Error Rate (WER)
Test WER	30.93

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご