🚀 AudioX: Multilingual Speech-to-Text Model
AudioX is a cutting - edge Indic multilingual automatic speech recognition (ASR) model family developed by Jivi AI. It consists of two specialized variants, AudioX - North and AudioX - South, each optimized for specific Indian languages to enhance accuracy. AudioX - North supports Hindi, Gujarati, and Marathi, while AudioX - South covers Tamil, Telugu, Kannada, and Malayalam. Trained on a combination of open - source ASR datasets and proprietary audio, AudioX models offer robust transcription capabilities across accents and acoustic conditions, delivering industry - leading performance in supported languages.

✨ Features
Purpose - Built for Indian Languages
AudioX is designed to handle diverse Indian language inputs, suitable for real - world applications such as voice assistants, transcription tools, customer service automation, and multilingual content creation. It provides high accuracy across regional accents and varying audio qualities.
Training Process
AudioX is fine - tuned using supervised learning on top of an open - source speech recognition backbone. The training pipeline incorporates domain adaptation, language balancing, and noise augmentation for robustness across real - world scenarios.
Data Preparation
The model is trained on:
- Open - source multilingual ASR corpora
- Proprietary Indian language medical datasets
This hybrid approach boosts the model’s generalization across dialects and acoustic conditions.
Benchmarks
AudioX achieves top performance across multiple Indian languages, outperforming both open and commercial ASR models. We evaluated AudioX on the Vistaar Benchmark using the official evaluation script provided by AI4Bharat’s Vistaar suite, ensuring rigorous, standardized comparison across diverse language scenarios.
Provider |
Model |
Hindi |
Gujarati |
Marathi |
Tamil |
Telugu |
Kannada |
Malayalam |
Avg WER |
Jivi AI |
AudioX |
12.14 |
18.66 |
18.68 |
21.79 |
24.63 |
17.61 |
26.92 |
20.1 |
ElevenLabs |
Scribe - v1 |
13.64 |
17.96 |
16.51 |
24.84 |
24.89 |
17.65 |
28.88 |
20.6 |
Sarvam |
saarika:v2 |
14.28 |
19.47 |
18.34 |
25.73 |
26.80 |
18.95 |
32.64 |
22.3 |
AI4Bharat |
IndicWhisper |
13.59 |
22.84 |
18.25 |
25.27 |
28.82 |
18.33 |
32.34 |
22.8 |
Microsoft |
Azure STT |
20.03 |
31.62 |
27.36 |
31.53 |
31.38 |
26.45 |
41.84 |
30.0 |
OpenAI |
gpt - 4o - transcribe |
18.65 |
31.32 |
25.21 |
39.10 |
33.94 |
32.88 |
46.11 |
32.5 |
Google |
Google STT |
23.89 |
36.48 |
26.48 |
33.62 |
42.42 |
31.48 |
47.90 |
34.6 |
OpenAI |
Whisper Large v3 |
32.00 |
53.75 |
78.28 |
52.44 |
179.58 |
67.02 |
142.98 |
86.6 |
📦 Installation
No specific installation steps are provided in the original document.
💻 Usage Examples
Basic Usage
You can easily run inference using the 🤗 transformers
and librosa
libraries. Here's a minimal example to get started:
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import librosa
device = "cuda"
processor = WhisperProcessor.from_pretrained("jiviai/audioX-north-v1")
model = WhisperForConditionalGeneration.from_pretrained("jiviai/audioX-north-v1").to(device)
model.config.forced_decoder_ids = None
audio_path = "sample.wav"
audio_np, sr = librosa.load(audio_path, sr=None)
if sr != 16000:
audio_np = librosa.resample(audio_np, orig_sr=sr, target_sr=16000)
input_features = processor(audio_np, sampling_rate=16000, return_tensors="pt").to(device).input_features
predicted_ids = model.generate(input_features, task="transcribe", language="hi")
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)
📄 License
This project is licensed under the Apache 2.0 license.