Whisper Hindi2Hinglish Swift
A Hindi-Hinglish mixed speech recognition model optimized based on the Whisper architecture, specifically designed for Indian accents and noisy environments
Downloads 496
Release Time : 1/7/2025
Model Overview
This model is a fine-tuned version of Whisper-base, focusing on transcribing Hindi speech into colloquial Hindi-English mixed text, suitable for speech recognition scenarios in India
Model Features
Hindi-English mixed language support
Added capability to transcribe audio into colloquial Hindi-English mixed text, reducing the probability of grammatical errors
Noise environment optimization
Specially optimized for common background noise environments in India, improving recognition accuracy in noisy scenarios
Hallucination suppression
Minimizes transcription hallucinations through training techniques, enhancing the accuracy of output text
Dynamic layer freezing technology
Adopts innovative training techniques for rapid convergence and efficient fine-tuning
Model Capabilities
Hindi speech recognition
Hindi-English mixed text generation
Speech transcription in noisy environments
Long audio processing
Use Cases
Speech transcription services
Customer service call transcription
Transcribing customer service call content in India into text records
Maintains high recognition accuracy in noisy environments
Meeting minutes
Automatically generating Hindi-English mixed meeting summaries
Supports multi-speaker dialogue scenarios
Voice assistants
Localized voice command recognition
Providing more accurate voice command recognition for users in India
Supports Hindi-English mixed colloquial expressions
language:
- en
- hi tags:
- audio
- automatic-speech-recognition
- whisper-event
- pytorch inference: true model-index:
- name: Whisper-Hindi2Hinglish-Swift
results:
- task:
type: automatic-speech-recognition
name: Automatic Speech Recognition
dataset:
name: google/fleurs
type: google/fleurs
config: hi_in
split: test
metrics:
- type: wer value: 35.0888 name: WER
- task:
type: automatic-speech-recognition
name: Automatic Speech Recognition
dataset:
name: mozilla-foundation/common_voice_20_0
type: mozilla-foundation/common_voice_20_0
config: hi
split: test
metrics:
- type: wer value: 38.6549 name: WER
- task:
type: automatic-speech-recognition
name: Automatic Speech Recognition
dataset:
name: Indic-Voices
type: Indic-Voices
config: hi
split: test
metrics:
- type: wer value: 65.2147 name: WER widget:
- task:
type: automatic-speech-recognition
name: Automatic Speech Recognition
dataset:
name: google/fleurs
type: google/fleurs
config: hi_in
split: test
metrics:
- src: audios/f89b6428-c58a-4355-ad63-0752b69f2d30.wav output: text: vah bas din mein kitni baar chalti hai?
- src: audios/09cf2547-9d09-4914-926a-cf2043549c15.wav output: text: >- Salmaan ki image se prabhaavit hote hain is company ke share bhaav jaane kaise?
- src: audios/6f7df89f-91a7-4cbd-be43-af7bce71a34b.wav output: text: vah roya aur aur roya.
- src: audios/969bede5-d816-461b-9bf2-bd115e098439.wav output: text: helmet na pahnne se bhaarat mein har gante hoti hai chaar logon ki maut.
- src: audios/cef43941-72c9-4d28-88dd-cb62808dc056.wav output: text: usne mujhe chithi ka javaab na dene ke lie daanta.
- src: audios/b27d49fe-fced-4a17-9887-7bfbc5d4a899.wav output: text: puraana shahar divaaron se ghera hua hai.
- src: audios/common_voice_hi_23796065.mp3 example_title: Speech Example 1
- src: audios/common_voice_hi_41666099.mp3 example_title: Speech Example 2
- src: audios/common_voice_hi_41429198.mp3 example_title: Speech Example 3
- src: audios/common_voice_hi_41429259.mp3 example_title: Speech Example 4
- src: audios/common_voice_hi_40904697.mp3 example_title: Speech Example 5 pipeline_tag: automatic-speech-recognition license: apache-2.0 metrics:
- wer base_model:
- openai/whisper-base library_name: transformers
Whisper-Hindi2Hinglish-Swift:
- GITHUB LINK: github link
- SPEECH-TO-TEXT ARENA: Speech-To-Text Arena
Table of Contents:
Key Features:
- Hinglish as a language: Added ability to transcribe audio into spoken Hinglish language reducing chances of grammatical errors
- Whisper Architecture: Based on the whisper architecture making it easy to use with the transformers package
- Hallucination Mitigation: Minimizes transcription hallucinations to enhance accuracy.
- Performance Increase: ~57% average performance increase versus pretrained model across benchmarking datasets
Training:
Data:
- Duration: A total of ~550 Hrs of noisy Indian-accented Hindi data was used to finetune the model.
- Collection: Due to a lack of ASR-ready hinglish datasets available, a specially curated proprietary dataset was used.
- Labelling: This data was then labeled using a SOTA model and the transcriptions were improved by human intervention.
- Quality: Emphasis was placed on collecting noisy data for the task as the intended use case of the model is in Indian environments where background noise is abundant.
- Processing: It was ensured that the audios are all chunked into chunks of length <30s, and there are at max 2 speakers in a clip. No further processing steps were done to not change the quality of the source data.
Finetuning:
- Novel Trainer Architecture: A custom trainer was written to ensure efficient supervised finetuning, with custom callbacks to enable higher observability during the training process.
- Custom Dynamic Layer Freezing: Most active layers were identified in the model by running inference on a subset of the training data using the pre-trained models. These layers were then kept unfrozen during the training process while all the other layers were kept frozen. This enabled faster convergence and efficient finetuning
- Deepspeed Integration: Deepspeed was also utilized to speed up, and optimize the training process.
Performance Overview
Qualitative Performance Overview
Audio | Whisper Base | Whisper-Hindi2Hinglish-Swift |
---|---|---|
وہاں بس دن میں کتنی بار چلتی ہے | vah bas din mein kitni baar chalti hai? | |
سلمان کی ایمیت سے پراوہویت ہوتے ہیں اس کمپنی کے سیر بھاؤ جانے کیسے | salmaan ki image se prabhaavit hote hain is company ke share bhaav jaane kaise? | |
تو لویا تو لویا | vah roya aur aur roya. | |
حلمت نہ پیننے سے بھارت میں ہر گنٹے ہوتی ہے چار لوگوں کی موت | helmet na pahnne se bhaarat mein har gante hoti hai chaar logon ki maut. | |
اوستہ مجھے چٹھیکہ جواب نہ دینے کے لیٹانٹہ | usne mujhe chithi ka javaab na dene ke lie daanta. | |
پرانا شاہ دیواروں سے گیرا ہوا ہے | puraana shahar divaaron se ghera hua hai. |
Quantitative Performance Overview
Note:
- The below WER scores are for Hinglish text generated by our model and the original whisper model
- To check our model's real-world performance against other SOTA models please head to our Speech-To-Text Arena arena space.
Dataset | Whisper Base | Whisper-Hindi2Hinglish-Swift |
---|---|---|
Common-Voice | 106.7936 | 38.6549 |
FLEURS | 104.2783 | 35.0888 |
Indic-Voices | 110.8399 | 65.2147 |
Usage:
Using Transformers
- To run the model, first install the Transformers library
pip install --upgrade transformers
- The model can be used with the
pipeline
class to transcribe audios of arbitrary length:
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset
# Set device (GPU if available, otherwise CPU) and precision
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
# Specify the pre-trained model ID
model_id = "Oriserve/Whisper-Hindi2Hinglish-Swift"
# Load the speech-to-text model with specified configurations
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id,
torch_dtype=torch_dtype, # Use appropriate precision (float16 for GPU, float32 for CPU)
low_cpu_mem_usage=True, # Optimize memory usage during loading
use_safetensors=True # Use safetensors format for better security
)
model.to(device) # Move model to specified device
# Load the processor for audio preprocessing and tokenization
processor = AutoProcessor.from_pretrained(model_id)
# Create speech recognition pipeline
pipe = pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
torch_dtype=torch_dtype,
device=device,
generate_kwargs={
"task": "transcribe", # Set task to transcription
"language": "en" # Specify English language
}
)
# Process audio file and print transcription
sample = "sample.wav" # Input audio file path
result = pipe(sample) # Run inference
print(result["text"]) # Print transcribed text
Using the OpenAI Whisper module
- First, install the openai-whisper library
pip install -U openai-whisper tqdm
- Convert the huggingface checkpoint to a pytorch model
import torch
from transformers import AutoModelForSpeechSeq2Seq
import re
from tqdm import tqdm
from collections import OrderedDict
import json
# Load parameter name mapping from HF to OpenAI format
with open('convert_hf2openai.json', 'r') as f:
reverse_translation = json.load(f)
reverse_translation = OrderedDict(reverse_translation)
def save_model(model, save_path):
def reverse_translate(current_param):
# Convert parameter names using regex patterns
for pattern, repl in reverse_translation.items():
if re.match(pattern, current_param):
return re.sub(pattern, repl, current_param)
# Extract model dimensions from config
config = model.config
model_dims = {
"n_mels": config.num_mel_bins, # Number of mel spectrogram bins
"n_vocab": config.vocab_size, # Vocabulary size
"n_audio_ctx": config.max_source_positions, # Max audio context length
"n_audio_state": config.d_model, # Audio encoder state dimension
"n_audio_head": config.encoder_attention_heads, # Audio encoder attention heads
"n_audio_layer": config.encoder_layers, # Number of audio encoder layers
"n_text_ctx": config.max_target_positions, # Max text context length
"n_text_state": config.d_model, # Text decoder state dimension
"n_text_head": config.decoder_attention_heads, # Text decoder attention heads
"n_text_layer": config.decoder_layers, # Number of text decoder layers
}
# Convert model state dict to Whisper format
original_model_state_dict = model.state_dict()
new_state_dict = {}
for key, value in tqdm(original_model_state_dict.items()):
key = key.replace("model.", "") # Remove 'model.' prefix
new_key = reverse_translate(key) # Convert parameter names
if new_key is not None:
new_state_dict[new_key] = value
# Create final model dictionary
pytorch_model = {"dims": model_dims, "model_state_dict": new_state_dict}
# Save converted model
torch.save(pytorch_model, save_path)
# Load Hugging Face model
model_id = "Oriserve/Whisper-Hindi2Hinglish-Swift"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id,
low_cpu_mem_usage=True, # Optimize memory usage
use_safetensors=True # Use safetensors format
)
# Convert and save model
model_save_path = "Whisper-Hindi2Hinglish-Swift.pt"
save_model(model,model_save_path)
- Transcribe
import whisper
# Load converted model with Whisper and transcribe
model = whisper.load_model("Whisper-Hindi2Hinglish-Swift.pt")
result = model.transcribe("sample.wav")
print(result["text"])
Miscellaneous
This model is from a family of transformers-based ASR models trained by Oriserve. To compare this model against other models from the same family or other SOTA models please head to our Speech-To-Text Arena. To learn more about our other models, and other queries regarding AI voice agents you can reach out to us at our email ai-team@oriserve.com
Voice Activity Detection
MIT
Voice activity detection model based on pyannote.audio 2.1, used to identify speech activity segments in audio
Speech Recognition
V
pyannote
7.7M
181
Wav2vec2 Large Xlsr 53 Portuguese
Apache-2.0
This is a fine-tuned XLSR-53 large model for Portuguese speech recognition tasks, trained on the Common Voice 6.1 dataset, supporting Portuguese speech-to-text conversion.
Speech Recognition Other
W
jonatasgrosman
4.9M
32
Whisper Large V3
Apache-2.0
Whisper is an advanced automatic speech recognition (ASR) and speech translation model proposed by OpenAI, trained on over 5 million hours of labeled data, with strong cross-dataset and cross-domain generalization capabilities.
Speech Recognition Supports Multiple Languages
W
openai
4.6M
4,321
Whisper Large V3 Turbo
MIT
Whisper is a state-of-the-art automatic speech recognition (ASR) and speech translation model developed by OpenAI, trained on over 5 million hours of labeled data, demonstrating strong generalization capabilities in zero-shot settings.
Speech Recognition
Transformers Supports Multiple Languages

W
openai
4.0M
2,317
Wav2vec2 Large Xlsr 53 Russian
Apache-2.0
A Russian speech recognition model fine-tuned from facebook/wav2vec2-large-xlsr-53, supporting 16kHz sampled audio input
Speech Recognition Other
W
jonatasgrosman
3.9M
54
Wav2vec2 Large Xlsr 53 Chinese Zh Cn
Apache-2.0
A Chinese speech recognition model fine-tuned based on facebook/wav2vec2-large-xlsr-53, supporting 16kHz sampling rate audio input.
Speech Recognition Chinese
W
jonatasgrosman
3.8M
110
Wav2vec2 Large Xlsr 53 Dutch
Apache-2.0
A Dutch speech recognition model fine-tuned based on facebook/wav2vec2-large-xlsr-53, trained on the Common Voice and CSS10 datasets, supporting 16kHz audio input.
Speech Recognition Other
W
jonatasgrosman
3.0M
12
Wav2vec2 Large Xlsr 53 Japanese
Apache-2.0
Japanese speech recognition model fine-tuned from facebook/wav2vec2-large-xlsr-53, supporting 16kHz sampling rate audio input
Speech Recognition Japanese
W
jonatasgrosman
2.9M
33
Mms 300m 1130 Forced Aligner
A text-to-audio forced alignment tool based on Hugging Face pre-trained models, supporting multiple languages with high memory efficiency
Speech Recognition
Transformers Supports Multiple Languages

M
MahmoudAshraf
2.5M
50
Wav2vec2 Large Xlsr 53 Arabic
Apache-2.0
Arabic speech recognition model fine-tuned from facebook/wav2vec2-large-xlsr-53, trained on Common Voice and Arabic speech corpus
Speech Recognition Arabic
W
jonatasgrosman
2.3M
37
Featured Recommended AI Models