đ IndicSeamless for Speech-to-Text Translation
IndicSeamless is a model for speech - to - text translation across Indian languages, offering high - performance translation capabilities.
đ Documentation
Model Overview
This repository hosts the IndicSeamless model. It is a SeamlessM4T - v2 finetuned on the BhasaAnuvaad dataset for speech - to - text translation (STT) across Indian languages. Before training, the dataset was filtered using the following thresholds:
- Alignment Score: 0.8
- Mining Score: 0.6
Performance Highlights
- The model outperforms the base SeamlessM4Tv2 model and all competing STT systems, including cascaded approaches.
- It achieves a new SOTA on Fleurs and significantly surpasses all other systems on the BhasaAnuvaad test set, which includes a diverse range of data from new domains.
Model Information
Property |
Details |
Library Name |
transformers |
Datasets |
ai4bharat/NPTEL, ai4bharat/IndicVoices - ST, ai4bharat/WordProject, ai4bharat/Spoken - Tutorial, ai4bharat/Mann - ki - Baat, ai4bharat/Vanipedia, ai4bharat/UGCE - Resources |
Pipeline Tag |
automatic - speech - recognition |
Languages |
en, as, bn, gu, hi, ta, te, ur, kn, ml, mr, sd, ne |
đĻ Installation
Ensure you have the required dependencies installed:
pip install torch torchaudio transformers datasets
đģ Usage Examples
Basic Usage
Loading the Model
import torchaudio
from transformers import SeamlessM4Tv2ForSpeechToText
from transformers import SeamlessM4TTokenizer, SeamlessM4TFeatureExtractor
model = SeamlessM4Tv2ForSpeechToText.from_pretrained("ai4bharat/indic-seamless").to("cuda")
processor = SeamlessM4TFeatureExtractor.from_pretrained("ai4bharat/indic-seamless")
tokenizer = SeamlessM4TTokenizer.from_pretrained("ai4bharat/indic-seamless")
Single Audio Inference
audio, orig_freq = torchaudio.load("../10002398547238927970.wav")
audio = torchaudio.functional.resample(audio, orig_freq=orig_freq, new_freq=16_000)
audio_inputs = processor(audio, sampling_rate=16_000, return_tensors="pt").to("cuda")
text_out = model.generate(**audio_inputs, tgt_lang="hin")[0].cpu().numpy().squeeze()
print(tokenizer.decode(text_out, clean_up_tokenization_spaces=True, skip_special_tokens=True))
Advanced Usage
Inference on Fleurs Dataset
from datasets import load_dataset
dataset = load_dataset("google/fleurs", "hi_in", split="test")
def process_audio(example):
audio = example["audio"]["array"]
audio_inputs = processor(audio, sampling_rate=16_000, return_tensors="pt").to("cuda")
text_out = model.generate(**audio_inputs, tgt_lang="hin")[0].cpu().numpy().squeeze()
return {"predicted_text": tokenizer.decode(text_out, clean_up_tokenization_spaces=True, skip_special_tokens=True)}
dataset = dataset.map(process_audio)
dataset = dataset.remove_columns(["audio"])
dataset.to_csv("fleurs_hi_predictions.csv")
Batch Translation using Fleurs
from datasets import load_dataset
import torch
def process_batch(batch):
audio_arrays = [audio["array"] for audio in batch["audio"]]
audio_inputs = processor(audio_arrays, sampling_rate=16_000, return_tensors="pt", padding=True).to("cuda")
text_outs = model.generate(**audio_inputs, tgt_lang="hin")
batch["predicted_text"] = [tokenizer.decode(text_out.cpu().numpy().squeeze(), clean_up_tokenization_spaces=True, skip_special_tokens=True) for text_out in text_outs]
return batch
def batch_translate(language_code="hi_in", tgt_lang="hin"):
dataset = load_dataset("google/fleurs", language_code, split="test")
dataset = dataset.map(process_batch, batched=True, batch_size=8)
return dataset["predicted_text"]
target_language = "hi_in"
translations = batch_translate(target_language, tgt_lang="hin")
print(translations)
đ Citation
If you use BhasaAnuvaad in your work, please cite us:
@misc{jain2024bhasaanuvaadspeechtranslationdataset,
title={BhasaAnuvaad: A Speech Translation Dataset for 13 Indian Languages},
author={Sparsh Jain and Ashwin Sankar and Devilal Choudhary and Dhairya Suman and Nikhil Narasimhan and Mohammed Safi Ur Rahman Khan and Anoop Kunchukuttan and Mitesh M Khapra and Raj Dabre},
year={2024},
eprint={2411.04699},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2411.04699},
}
đ License
This model is released under the Creative Commons Attribution - NonCommercial 4.0 International (CC BY - NC 4.0) license.