🚀 NavaiSTT-1v Medium - Uzbek Speech-to-Text Model
A fine - tuned Whisper medium model for high - quality Uzbek speech transcription.
This is a classic Whisper medium model that has been fine - tuned specifically for the Uzbek language. The training dataset encompasses around 700 hours of diverse audio, including publicly available podcasts, Tashkent dialect podcasts, audiobooks, and the Common Voice 17 dataset. The data quality is a mix, with 60% of the data being human - transcribed and 40% pseudo - transcribed using Gemini 2.5 Pro. Special emphasis was placed on Tashkent dialect audio materials, which enables the model to perform strongly on this dialect. Future versions aim to incorporate other regional dialects to enhance overall coverage.
📚 Documentation
For more details on the methodology and research behind this model, visit: https://uz - speech.web.app/navaistt01m
✨ Features
Model Details
Property |
Details |
Model Type |
Whisper Medium |
Parameters |
769M |
WER |
~13% |
CER |
~3.5% |
Training Data
This model was fine - tuned on approximately 700 hours of diverse Uzbek audio data, which includes:
- Publicly available podcasts
- Tashkent dialect podcasts
- Audiobooks
- Common Voice 17 dataset
The dataset is composed of 60% human - transcribed and 40% pseudo - transcribed material (using Gemini 2.5 Pro). Special attention was given to Tashkent dialect audio materials to ensure excellent performance on this dialect.
💻 Usage Examples
Basic Usage
import torch
import torchaudio
from transformers import WhisperProcessor, WhisperForConditionalGeneration
processor = WhisperProcessor.from_pretrained("islomov/navaistt_v1_medium")
model = WhisperForConditionalGeneration.from_pretrained("islomov/navaistt_v1_medium")
def transcribe_audio(audio_path):
global model, processor
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)
waveform, sample_rate = torchaudio.load(audio_path)
if sample_rate != 16000:
waveform = torchaudio.functional.resample(waveform, sample_rate, 16000)
if waveform.shape[0] > 1:
waveform = waveform.mean(dim=0, keepdim=True)
input_features = processor(
waveform.squeeze().numpy(),
sampling_rate=16000,
return_tensors="pt",
language="uz"
).input_features.to(device)
with torch.no_grad():
predicted_ids = model.generate(input_features)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
return transcription
if __name__ == "__main__":
audio_file = "some_audio_max_30_sec.wav"
text = transcribe_audio(audio_file)
print(f"Transcription: {text}")
🔮 Future Improvements
Future versions will include more regional Uzbek dialects to improve overall coverage.
📄 License
This project is licensed under the Apache - 2.0 license.