Whisper-Telugu-Medium Open-Source Speech Recognition Model - Accurately Identify Telugu Speech Content

Whisper Telugu Medium

Developed by vasista22

Telugu speech recognition model fine-tuned based on OpenAI Whisper-medium, trained on multiple public Telugu ASR datasets

Speech Recognition OtherOpen Source License:Apache-2.0 #Telugu speech recognition #Low word error rate #Multi-dataset fine-tuning

Downloads 228

Release Time : 12/20/2022

Model Overview

This model is an automatic speech recognition (ASR) model optimized for Telugu, capable of accurately transcribing Telugu speech into text

Model Features

Multi-dataset training

Combines multiple authoritative Telugu ASR datasets including CSTD IIIT-H, ULCA, and Shrutilipi

High performance

Achieves a word error rate (WER) of 9.47% on the Fleurs test set

Efficient inference support

Provides two inference solutions: standard transformers and whisper-jax, supporting GPU acceleration

Model Capabilities

Telugu speech recognition

Long audio processing (supports chunk processing)

Multi-scenario speech transcription

Use Cases

Speech transcription

Meeting minutes

Convert Telugu meeting recordings into text records

Highly accurate transcribed text

Media subtitle generation

Automatically generate subtitles for Telugu video content

Synchronized and accurate text subtitles

Voice assistant

Telugu voice interaction

Build voice assistant applications supporting Telugu

Natural and smooth voice interaction experience

🚀 Whisper Telugu Medium

This model is a fine - tuned version of openai/whisper-medium on Telugu data from multiple publicly available ASR corpuses. It was fine - tuned as part of the Whisper fine - tuning sprint, offering high - quality automatic speech recognition for Telugu language.

NOTE: The code for training this model can be reused from the whisper-finetune repository.

🚀 Quick Start

💻 Usage Examples

Basic Usage

>>> import torch
>>> from transformers import pipeline

>>> # path to the audio file to be transcribed
>>> audio = "/path/to/audio.format"
>>> device = "cuda:0" if torch.cuda.is_available() else "cpu"

>>> transcribe = pipeline(task="automatic-speech-recognition", model="vasista22/whisper-telugu-medium", chunk_length_s=30, device=device)
>>> transcribe.model.config.forced_decoder_ids = transcribe.tokenizer.get_decoder_prompt_ids(language="te", task="transcribe")

>>> print('Transcription: ', transcribe(audio)["text"])

Advanced Usage

>>> import jax.numpy as jnp
>>> from whisper_jax import FlaxWhisperForConditionalGeneration, FlaxWhisperPipline

>>> # path to the audio file to be transcribed
>>> audio = "/path/to/audio.format"

>>> transcribe = FlaxWhisperPipline("vasista22/whisper-telugu-medium", batch_size=16)
>>> transcribe.model.config.forced_decoder_ids = transcribe.tokenizer.get_decoder_prompt_ids(language="te", task="transcribe")

>>> print('Transcription: ', transcribe(audio)["text"])

In order to evaluate this model on an entire dataset, you can use the evaluation codes in the whisper-finetune repository. The same repository also has scripts for faster inference using whisper - jax.

For faster inference of whisper models, you can use the whisper-jax library. Make sure to follow the installation steps here before using the above advanced code snippet.

📦 Installation

No specific installation steps are provided in the original document. However, to use the model, you need to install relevant libraries such as transformers, torch, and potentially whisper - jax for faster inference. You can install them via pip:

pip install transformers torch

If you want to use whisper - jax for faster inference:

pip install git+https://github.com/sanchit-gandhi/whisper-jax.git

📚 Documentation

Training and evaluation data

Training Data

CSTD IIIT - H ASR Corpus
[ULCA ASR Corpus](https://github.com/Open - Speech - EkStep/ULCA - asr - dataset - corpus#telugu - labelled - total - duration - is - 102593 - hours)
Shrutilipi ASR Corpus
[Microsoft Speech Corpus (Indian Languages)](https://msropendata.com/datasets/7230b4b1 - 912d - 400e - be58 - f84e0512985e)
Google/Fleurs Train+Dev set
Babel ASR Corpus

Evaluation Data

[Microsoft Speech Corpus (Indian Languages) Test Set](https://msropendata.com/datasets/7230b4b1 - 912d - 400e - be58 - f84e0512985e)
Google/Fleurs Test Set
OpenSLR
Babel Test Set

Training hyperparameters

Property	Details
learning_rate	1e - 05
train_batch_size	24
eval_batch_size	48
seed	22
optimizer	adamw_bnb_8bit
lr_scheduler_type	linear
lr_scheduler_warmup_steps	15000
training_steps	35808 (terminated upon convergence. Initially set to 89520 steps)
mixed_precision_training	True

Acknowledgement

This work was done at Speech Lab, IIT Madras. The compute resources were funded by the "Bhashini: National Language translation Mission" project of the Ministry of Electronics and Information Technology (MeitY), Government of India.

📄 License

This model is released under the Apache - 2.0 license.

🔍 Model Index

Property	Details
Model Name	Whisper Telugu Medium - Vasista Sai Lodagala
Task Type	Automatic Speech Recognition
Dataset	google/fleurs (te_in split, test set)
Metric (WER)	9.47

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご