Indic-seamless Open-source Indian Language Translation Model - Free Deployment for Converting Speech to Text in 13 Indian Languages

Indic Seamless

Developed by ai4bharat

A speech-to-text translation model for Indian languages fine-tuned on SeamlessM4T-v2, supporting 13 Indian languages with performance surpassing the base model and competing systems.

Speech Recognition

Transformers

Supports Multiple Languages#Indian Language STT #Multi-domain Speech Translation #SeamlessM4T Fine-tuning

Downloads 917

Release Time : 3/4/2025

Model Overview

This model specializes in speech-to-text translation (STT) for Indian languages, fine-tuned on the BhasaAnuvaad dataset and setting new records on the Fleurs dataset.

Model Features

Multilingual Support

Supports 13 Indian languages, covering major Indian language families.

High Performance

Sets new records on the Fleurs dataset and significantly outperforms other systems on the BhasaAnuvaad test set.

Strict Data Filtering

Applied threshold filtering for alignment score (0.8) and mining score (0.6) to the dataset before training.

Model Capabilities

Speech-to-text translation

Multilingual speech recognition

Batch audio processing

Use Cases

Speech Transcription

Single Audio Transcription

Transcribe a single audio file into text in a specified Indian language

Higher accuracy than the base model and competing systems

Batch Processing

Dataset Batch Transcription

Batch transcription processing for speech datasets like Fleurs

Supports batch processing with high efficiency

🚀 IndicSeamless for Speech-to-Text Translation

IndicSeamless is a model for speech - to - text translation across Indian languages, offering high - performance translation capabilities.

📚 Documentation

Model Overview

This repository hosts the IndicSeamless model. It is a SeamlessM4T - v2 finetuned on the BhasaAnuvaad dataset for speech - to - text translation (STT) across Indian languages. Before training, the dataset was filtered using the following thresholds:

Alignment Score: 0.8
Mining Score: 0.6

Performance Highlights

The model outperforms the base SeamlessM4Tv2 model and all competing STT systems, including cascaded approaches.
It achieves a new SOTA on Fleurs and significantly surpasses all other systems on the BhasaAnuvaad test set, which includes a diverse range of data from new domains.

Model Information

Property	Details
Library Name	transformers
Datasets	ai4bharat/NPTEL, ai4bharat/IndicVoices - ST, ai4bharat/WordProject, ai4bharat/Spoken - Tutorial, ai4bharat/Mann - ki - Baat, ai4bharat/Vanipedia, ai4bharat/UGCE - Resources
Pipeline Tag	automatic - speech - recognition
Languages	en, as, bn, gu, hi, ta, te, ur, kn, ml, mr, sd, ne

📦 Installation

Ensure you have the required dependencies installed:

pip install torch torchaudio transformers datasets

💻 Usage Examples

Basic Usage

Loading the Model

import torchaudio
from transformers import SeamlessM4Tv2ForSpeechToText
from transformers import SeamlessM4TTokenizer, SeamlessM4TFeatureExtractor

model = SeamlessM4Tv2ForSpeechToText.from_pretrained("ai4bharat/indic-seamless").to("cuda")
processor = SeamlessM4TFeatureExtractor.from_pretrained("ai4bharat/indic-seamless")
tokenizer = SeamlessM4TTokenizer.from_pretrained("ai4bharat/indic-seamless")

Single Audio Inference

audio, orig_freq = torchaudio.load("../10002398547238927970.wav")
audio = torchaudio.functional.resample(audio, orig_freq=orig_freq, new_freq=16_000) # must be a 16 kHz waveform array
audio_inputs = processor(audio, sampling_rate=16_000, return_tensors="pt").to("cuda")

text_out = model.generate(**audio_inputs, tgt_lang="hin")[0].cpu().numpy().squeeze()
print(tokenizer.decode(text_out, clean_up_tokenization_spaces=True, skip_special_tokens=True))

Advanced Usage

Inference on Fleurs Dataset

from datasets import load_dataset

dataset = load_dataset("google/fleurs", "hi_in", split="test")

def process_audio(example):
    audio = example["audio"]["array"]
    audio_inputs = processor(audio, sampling_rate=16_000, return_tensors="pt").to("cuda")
    text_out = model.generate(**audio_inputs, tgt_lang="hin")[0].cpu().numpy().squeeze()
    return {"predicted_text": tokenizer.decode(text_out, clean_up_tokenization_spaces=True, skip_special_tokens=True)}

dataset = dataset.map(process_audio)
dataset = dataset.remove_columns(["audio"])
dataset.to_csv("fleurs_hi_predictions.csv")

Batch Translation using Fleurs

from datasets import load_dataset
import torch

def process_batch(batch):
    audio_arrays = [audio["array"] for audio in batch["audio"]]
    audio_inputs = processor(audio_arrays, sampling_rate=16_000, return_tensors="pt", padding=True).to("cuda")
    text_outs = model.generate(**audio_inputs, tgt_lang="hin")
    batch["predicted_text"] = [tokenizer.decode(text_out.cpu().numpy().squeeze(), clean_up_tokenization_spaces=True, skip_special_tokens=True) for text_out in text_outs]
    return batch

def batch_translate(language_code="hi_in", tgt_lang="hin"):
    dataset = load_dataset("google/fleurs", language_code, split="test")
    dataset = dataset.map(process_batch, batched=True, batch_size=8)
    return dataset["predicted_text"]

# Example usage
target_language = "hi_in"
translations = batch_translate(target_language, tgt_lang="hin")
print(translations)

📖 Citation

If you use BhasaAnuvaad in your work, please cite us:

@misc{jain2024bhasaanuvaadspeechtranslationdataset,
      title={BhasaAnuvaad: A Speech Translation Dataset for 13 Indian Languages}, 
      author={Sparsh Jain and Ashwin Sankar and Devilal Choudhary and Dhairya Suman and Nikhil Narasimhan and Mohammed Safi Ur Rahman Khan and Anoop Kunchukuttan and Mitesh M Khapra and Raj Dabre},
      year={2024},
      eprint={2411.04699},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2411.04699}, 
}

📄 License

This model is released under the Creative Commons Attribution - NonCommercial 4.0 International (CC BY - NC 4.0) license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご