The open-source model of wav2vec2-large-xlsr-53-telugu - Empowering accurate Telugu speech recognition

Wav2vec2 Large Xlsr 53 Telugu

Developed by anuragshas

A Telugu speech recognition model fine-tuned based on the facebook/wav2vec2-large-xlsr-53 model, trained using the OpenSLR SLR66 dataset

Speech Recognition OtherOpen Source License:Apache-2.0 #Telugu speech recognition #Low-resource language ASR #Wav2Vec2 fine-tuning

Downloads 44.24k

Release Time : 3/2/2022

Model Overview

This is an automatic speech recognition (ASR) model for Telugu, based on the Wav2Vec2 architecture, suitable for converting Telugu speech into text.

Model Features

Dedicated to Telugu

A speech recognition model specifically optimized for Telugu

Based on the XLSR pre-trained model

Utilizes the pre-trained knowledge of large-scale cross-lingual speech representation learning (XLSR)

No need for a language model

Can be used directly without additional language model support

Model Capabilities

Telugu speech recognition

16kHz audio processing

Use Cases

Speech-to-text

Telugu speech transcription

Convert Telugu speech content into text

Achieved a 44.98% WER on the OpenSLR test set

🚀 Wav2Vec2-Large-XLSR-53-Telugu

This model is fine-tuned from facebook/wav2vec2-large-xlsr-53 on Telugu using the OpenSLR SLR66 dataset, aiming to solve the problem of automatic speech recognition in Telugu.

📦 Model Information

Property	Details
Language	Telugu
Datasets	openslr
Metrics	wer
Tags	audio, automatic-speech-recognition, speech, xlsr-fine-tuning-week
License	apache-2.0

📊 Model Results

Task	Dataset	Metrics
Speech Recognition (automatic-speech-recognition)	OpenSLR te (openslr, args: te)	Test WER: 44.98

🚀 Quick Start

Fine-tuned facebook/wav2vec2-large-xlsr-53 on Telugu using the OpenSLR SLR66 dataset. When using this model, make sure that your speech input is sampled at 16kHz.

✨ Features

Fine-tuned on the Telugu language using the OpenSLR SLR66 dataset.
Can be used directly for automatic speech recognition without a language model.

💻 Usage Examples

Basic Usage

The model can be used directly (without a language model) as follows:

import torch
import torchaudio
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import pandas as pd
# Evaluation notebook contains the procedure to download the data
df = pd.read_csv("/content/te/test.tsv", sep="\t")
df["path"] = "/content/te/clips/" + df["path"]
test_dataset = Dataset.from_pandas(df)
processor = Wav2Vec2Processor.from_pretrained("anuragshas/wav2vec2-large-xlsr-53-telugu")
model = Wav2Vec2ForCTC.from_pretrained("anuragshas/wav2vec2-large-xlsr-53-telugu") 
resampler = torchaudio.transforms.Resample(48_000, 16_000)
# Preprocessing the datasets.
# We need to read the aduio files as arrays
def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = torchaudio.load(batch["path"])
    batch["speech"] = resampler(speech_array).squeeze().numpy()
    return batch
test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)
with torch.no_grad():
    logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
predicted_ids = torch.argmax(logits, dim=-1)
print("Prediction:", processor.batch_decode(predicted_ids))
print("Reference:", test_dataset["sentence"][:2])

Advanced Usage

import torch
import torchaudio
from datasets import Dataset, load_metric
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import re
from sklearn.model_selection import train_test_split
import pandas as pd
# Evaluation notebook contains the procedure to download the data
df = pd.read_csv("/content/te/test.tsv", sep="\t")
df["path"] = "/content/te/clips/" + df["path"]
test_dataset = Dataset.from_pandas(df)
wer = load_metric("wer")
processor = Wav2Vec2Processor.from_pretrained("anuragshas/wav2vec2-large-xlsr-53-telugu")
model = Wav2Vec2ForCTC.from_pretrained("anuragshas/wav2vec2-large-xlsr-53-telugu") 
model.to("cuda")
chars_to_ignore_regex = '[\,\?\.\!\-\_\;\:\"\“\%\‘\”\।\’\'\&]'
resampler = torchaudio.transforms.Resample(48_000, 16_000)
def normalizer(text):
    # Use your custom normalizer
    text = text.replace("\\n","\n")
    text = ' '.join(text.split())
    text = re.sub(r'''([a-z]+)''','',text,flags=re.IGNORECASE)
    text = re.sub(r'''%'''," శాతం ", text)
    text = re.sub(r'''(/|-|_)'''," ", text)
    text = re.sub("ై","ై", text)
    text = text.strip()
    return text
def speech_file_to_array_fn(batch):
    batch["sentence"] = normalizer(batch["sentence"])
    batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower()+ " "
    speech_array, sampling_rate = torchaudio.load(batch["path"])
    batch["speech"] = resampler(speech_array).squeeze().numpy()
    return batch
test_dataset = test_dataset.map(speech_file_to_array_fn)
# Preprocessing the datasets.
# We need to read the aduio files as arrays
def evaluate(batch):
    inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
    with torch.no_grad():
        logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits
    pred_ids = torch.argmax(logits, dim=-1)
    batch["pred_strings"] = processor.batch_decode(pred_ids)
    return batch
result = test_dataset.map(evaluate, batched=True, batch_size=8)
print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))

⚠️ Important Note

The test result shows a WER (Word Error Rate) of 44.98%.

🔧 Technical Details

70% of the OpenSLR Telugu dataset was used for training.

Train Split of annotations is here
Test Split of annotations is here
Training Data Preparation notebook can be found here
Training notebook can be found here
Evaluation notebook is here

📄 License

This model is licensed under the apache-2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご