Navaistt_v1_medium Open-source Uzbek Speech Recognition Model - Free Deployment with Support for Tashkent Dialect

Navaistt V1 Medium

Developed by islomov

Uzbek speech recognition model fine-tuned based on Whisper medium, supports Tashkent dialect, trained on approximately 700 hours of data

Speech Recognition

Safetensors

OtherOpen Source License:Apache-2.0 #Uzbek speech recognition #Tashkent dialect optimization #Multi-source data training

Downloads 3,081

Release Time : 5/2/2025

Model Overview

An automatic speech recognition model optimized for Uzbek, specifically enhanced for the Tashkent dialect, suitable for audio transcription tasks

Model Features

Tashkent Dialect Optimization

Special focus on Tashkent dialect audio materials, ensuring excellent performance on this dialect

Diverse Training Data

Uses approximately 700 hours of diverse audio data, including podcasts, audiobooks, and Common Voice corpus

Mixed-Quality Data Training

Hybrid training strategy with 60% human-transcribed and 40% pseudo-transcribed materials (generated by Gemini 2.5 Pro)

Model Capabilities

Uzbek speech recognition

Tashkent dialect recognition

Audio transcription

Short audio processing within 30 seconds

Use Cases

Speech Transcription

Podcast Content Transcription

Automatically convert Uzbek podcast content into text

Word Error Rate ~13%

Audiobook Transcription

Convert Uzbek audiobooks into text format

Voice Assistants

Uzbek Voice Input

Add Uzbek voice input functionality to applications

🚀 NavaiSTT-1v Medium - Uzbek Speech-to-Text Model

A fine - tuned Whisper medium model for high - quality Uzbek speech transcription.

This is a classic Whisper medium model that has been fine - tuned specifically for the Uzbek language. The training dataset encompasses around 700 hours of diverse audio, including publicly available podcasts, Tashkent dialect podcasts, audiobooks, and the Common Voice 17 dataset. The data quality is a mix, with 60% of the data being human - transcribed and 40% pseudo - transcribed using Gemini 2.5 Pro. Special emphasis was placed on Tashkent dialect audio materials, which enables the model to perform strongly on this dialect. Future versions aim to incorporate other regional dialects to enhance overall coverage.

📚 Documentation

For more details on the methodology and research behind this model, visit: https://uz - speech.web.app/navaistt01m

✨ Features

Model Details

Property	Details
Model Type	Whisper Medium
Parameters	769M
WER	~13%
CER	~3.5%

Training Data

This model was fine - tuned on approximately 700 hours of diverse Uzbek audio data, which includes:

Publicly available podcasts
Tashkent dialect podcasts
Audiobooks
Common Voice 17 dataset

The dataset is composed of 60% human - transcribed and 40% pseudo - transcribed material (using Gemini 2.5 Pro). Special attention was given to Tashkent dialect audio materials to ensure excellent performance on this dialect.

💻 Usage Examples

Basic Usage

import torch
import torchaudio
from transformers import WhisperProcessor, WhisperForConditionalGeneration

# Load model and processor
processor = WhisperProcessor.from_pretrained("islomov/navaistt_v1_medium")
model = WhisperForConditionalGeneration.from_pretrained("islomov/navaistt_v1_medium")

def transcribe_audio(audio_path):

    global model, processor

    # Move to GPU if available
    device = "cuda" if torch.cuda.is_available() else "cpu"
    model = model.to(device)

    # Load and preprocess audio
    waveform, sample_rate = torchaudio.load(audio_path)
    if sample_rate != 16000:
        waveform = torchaudio.functional.resample(waveform, sample_rate, 16000)

    # Convert to mono if needed
    if waveform.shape[0] > 1:
        waveform = waveform.mean(dim=0, keepdim=True)

    # Process audio
    input_features = processor(
        waveform.squeeze().numpy(),
        sampling_rate=16000,
        return_tensors="pt",
        language="uz"
    ).input_features.to(device)

    # Generate transcription
    with torch.no_grad():
        predicted_ids = model.generate(input_features)

    # Decode
    transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
    return transcription

# Example usage
if __name__ == "__main__":
    audio_file = "some_audio_max_30_sec.wav"

    text = transcribe_audio(audio_file)
    print(f"Transcription: {text}")

🔮 Future Improvements

Future versions will include more regional Uzbek dialects to improve overall coverage.

📄 License

This project is licensed under the Apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご