audioX-north-v1 Open-Source Multilingual Automatic Speech Recognition Model - Free Deployment for Precise Recognition of Indian Languages

Audiox North V1

Developed by jiviai

AudioX is a multilingual automatic speech recognition model developed by Jivi AI, optimized for Indian languages, supporting Hindi, Gujarati, and Marathi.

Speech Recognition

Safetensors

OtherOpen Source License:Apache-2.0 #Indian Multilingual ASR #Medical Scenario Optimization #Low WER Transcription

Downloads 810

Release Time : 2/16/2025

Model Overview

AudioX is a series of automatic speech recognition models specifically designed for Indian languages, including variants optimized for different language groups, providing high-accuracy speech-to-text services.

Model Features

Multilingual Support

Specially optimized to support multiple Indian languages including Hindi, Gujarati, and Marathi.

High Accuracy

Outperforms multiple commercial ASR models on the Vistaar Benchmark.

Strong Robustness

Capable of handling speech inputs with various regional accents and under different acoustic conditions.

Hybrid Training Data

Trained with a combination of open-source ASR corpora and proprietary medical datasets.

Model Capabilities

Speech-to-text

Multilingual speech recognition

Accent adaptation

Use Cases

Voice Assistant

Multilingual Voice Interaction

Provides voice interaction support for multilingual users in India

Transcription Services

Medical Record Transcription

Transcribes doctor-patient conversations into text records

Customer Service

Automated Customer Service System

Handles voice queries from multilingual customers

🚀 AudioX: Multilingual Speech-to-Text Model

AudioX is a cutting - edge Indic multilingual automatic speech recognition (ASR) model family developed by Jivi AI. It consists of two specialized variants, AudioX - North and AudioX - South, each optimized for specific Indian languages to enhance accuracy. AudioX - North supports Hindi, Gujarati, and Marathi, while AudioX - South covers Tamil, Telugu, Kannada, and Malayalam. Trained on a combination of open - source ASR datasets and proprietary audio, AudioX models offer robust transcription capabilities across accents and acoustic conditions, delivering industry - leading performance in supported languages.

AudioX

✨ Features

Purpose - Built for Indian Languages

AudioX is designed to handle diverse Indian language inputs, suitable for real - world applications such as voice assistants, transcription tools, customer service automation, and multilingual content creation. It provides high accuracy across regional accents and varying audio qualities.

Training Process

AudioX is fine - tuned using supervised learning on top of an open - source speech recognition backbone. The training pipeline incorporates domain adaptation, language balancing, and noise augmentation for robustness across real - world scenarios.

Data Preparation

The model is trained on:

Open - source multilingual ASR corpora
Proprietary Indian language medical datasets

This hybrid approach boosts the model’s generalization across dialects and acoustic conditions.

Benchmarks

AudioX achieves top performance across multiple Indian languages, outperforming both open and commercial ASR models. We evaluated AudioX on the Vistaar Benchmark using the official evaluation script provided by AI4Bharat’s Vistaar suite, ensuring rigorous, standardized comparison across diverse language scenarios.

Provider	Model	Hindi	Gujarati	Marathi	Tamil	Telugu	Kannada	Malayalam	Avg WER
Jivi AI	AudioX	12.14	18.66	18.68	21.79	24.63	17.61	26.92	20.1
ElevenLabs	Scribe - v1	13.64	17.96	16.51	24.84	24.89	17.65	28.88	20.6
Sarvam	saarika:v2	14.28	19.47	18.34	25.73	26.80	18.95	32.64	22.3
AI4Bharat	IndicWhisper	13.59	22.84	18.25	25.27	28.82	18.33	32.34	22.8
Microsoft	Azure STT	20.03	31.62	27.36	31.53	31.38	26.45	41.84	30.0
OpenAI	gpt - 4o - transcribe	18.65	31.32	25.21	39.10	33.94	32.88	46.11	32.5
Google	Google STT	23.89	36.48	26.48	33.62	42.42	31.48	47.90	34.6
OpenAI	Whisper Large v3	32.00	53.75	78.28	52.44	179.58	67.02	142.98	86.6

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

You can easily run inference using the 🤗 transformers and librosa libraries. Here's a minimal example to get started:

from transformers import WhisperProcessor, WhisperForConditionalGeneration
import librosa

# Load model and processor
device = "cuda"
processor = WhisperProcessor.from_pretrained("jiviai/audioX-north-v1")
model = WhisperForConditionalGeneration.from_pretrained("jiviai/audioX-north-v1").to(device)
model.config.forced_decoder_ids = None

# Load and preprocess audio
audio_path = "sample.wav"
audio_np, sr = librosa.load(audio_path, sr=None)
if sr != 16000:
    audio_np = librosa.resample(audio_np, orig_sr=sr, target_sr=16000)

input_features = processor(audio_np, sampling_rate=16000, return_tensors="pt").to(device).input_features

# Generate predictions
# Use ISO 639-1 language codes: "hi", "mr", "gu" for North; "ta", "te", "kn", "ml" for South
# Or omit the language argument for automatic language detection
predicted_ids = model.generate(input_features, task="transcribe", language="hi")

# Decode predictions
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)

📄 License

This project is licensed under the Apache 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご