Whisper-tamil-medium Open-source Model - A Practical Tool for Free Tamil Speech Recognition

Home

Whisper Tamil Medium

Developed by vasista22

A Whisper-medium model fine-tuned on multiple public Tamil ASR corpora, supporting Tamil speech recognition

Speech Recognition

Transformers

OtherOpen Source License:Apache-2.0 #Tamil speech recognition #Multi-corpus fine-tuning #Low-resource optimization

Downloads 1,731

Release Time : 12/21/2022

Model Overview

This model is a version of openai/whisper-medium fine-tuned for Tamil. It is specifically designed for Tamil speech recognition tasks and is part of the Whisper fine-tuning sprint project.

Model Features

Multi-corpus fine-tuning

Trained on 6 public Tamil ASR corpora with rich data sources

Efficient inference support

Provides whisper-jax implementation for fast batch inference

Complete evaluation scheme

Includes evaluation codes and results on multiple test sets

Model Capabilities

Tamil speech recognition

Long audio processing (support for chunking)

Batch inference

Use Cases

Speech transcription

Tamil meeting records

Convert Tamil meeting recordings into text records

Educational content transcription

Transcribe Tamil teaching audio content

🚀 Whisper Tamil Medium

This model is a fine - tuned version of [openai/whisper - medium](https://huggingface.co/openai/whisper - medium) on Tamil data from multiple public ASR corpuses, aiming to enhance automatic speech recognition performance in Tamil.

🚀 Quick Start

This model is a fine - tuned version of openai/whisper-medium on the Tamil data available from multiple publicly available ASR corpuses. It has been fine - tuned as a part of the Whisper fine - tuning sprint.

NOTE: The code used to train this model is available for re - use in the whisper-finetune repository.

✨ Features

Fine - tuned on multiple publicly available Tamil ASR corpuses.
Code for training and evaluation is open - sourced and reusable.
Supports faster inference using whisper - jax.

📦 Installation

The installation steps are not explicitly provided in the original README. However, relevant libraries and dependencies can be installed as per the requirements in the whisper-finetune repository.

💻 Usage Examples

Basic Usage

In order to infer a single audio file using this model, the following code snippet can be used:

>>> import torch
>>> from transformers import pipeline

>>> # path to the audio file to be transcribed
>>> audio = "/path/to/audio.format"
>>> device = "cuda:0" if torch.cuda.is_available() else "cpu"

>>> transcribe = pipeline(task="automatic-speech-recognition", model="vasista22/whisper-tamil-medium", chunk_length_s=30, device=device)
>>> transcribe.model.config.forced_decoder_ids = transcribe.tokenizer.get_decoder_prompt_ids(language="ta", task="transcribe")

>>> print('Transcription: ', transcribe(audio)["text"])

Advanced Usage

For faster inference of whisper models, the whisper-jax library can be used. Please follow the necessary installation steps as mentioned here, before using the following code snippet:

>>> import jax.numpy as jnp
>>> from whisper_jax import FlaxWhisperForConditionalGeneration, FlaxWhisperPipline

>>> # path to the audio file to be transcribed
>>> audio = "/path/to/audio.format"

>>> transcribe = FlaxWhisperPipline("vasista22/whisper-tamil-medium", batch_size=16)
>>> transcribe.model.config.forced_decoder_ids = transcribe.tokenizer.get_decoder_prompt_ids(language="ta", task="transcribe")

>>> print('Transcription: ', transcribe(audio)["text"])

📚 Documentation

In order to evaluate this model on an entire dataset, the evaluation codes available in the whisper-finetune repository can be used. The same repository also provides the scripts for faster inference using whisper - jax.

🔧 Technical Details

Training and evaluation data

Property	Details
Training Data	IISc - MILE Tamil ASR Corpus, [ULCA ASR Corpus](https://github.com/Open - Speech - EkStep/ULCA - asr - dataset - corpus#tamil - labelled--total - duration - is - 116024 - hours), Shrutilipi ASR Corpus, [Microsoft Speech Corpus (Indian Languages)](https://msropendata.com/datasets/7230b4b1 - 912d - 400e - be58 - f84e0512985e), Google/Fleurs Train+Dev set, Babel ASR Corpus
Evaluation Data	[Microsoft Speech Corpus (Indian Languages) Test Set](https://msropendata.com/datasets/7230b4b1 - 912d - 400e - be58 - f84e0512985e), Google/Fleurs Test Set, IISc - MILE Test Set, Babel Test Set

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 1e - 05
train_batch_size: 24
eval_batch_size: 48
seed: 22
optimizer: adamw_bnb_8bit
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 17500
training_steps: 33892 (Initially set to 84730 steps)
mixed_precision_training: True

📄 License

This project is licensed under the Apache - 2.0 license.

Acknowledgement

This work was done at Speech Lab, IIT Madras. The compute resources for this work were funded by "Bhashini: National Language translation Mission" project of the Ministry of Electronics and Information Technology (MeitY), Government of India.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご