Model Selection

Real-time speech generation

# Real-time speech generation

Qwen2.5 Omni 7B AWQ

Qwen2.5-Omni is an end-to-end multimodal model capable of perceiving multiple modalities including text, images, audio, and video, while generating text and natural speech responses in a streaming manner.

Multimodal Fusion

Transformers English

Spark TTS 0.5B 8bit

This is a text-to-speech model based on the MLX format, supporting both English and Chinese, converted from prince-canuma/Spark-TTS-0.5B.

Speech Synthesis Supports Multiple Languages

Spark TTS 0.5B 4 6bit

Spark-TTS-0.5B-4-6bit is a text-to-speech model based on the MLX format, supporting both English and Chinese.

Speech Synthesis Supports Multiple Languages

Muyan TTS SFT Q8 0 GGUF

This model is a GGUF format text-to-speech model converted from MYZY-AI/Muyan-TTS-SFT, supporting Chinese speech synthesis.

Speech Synthesis

Kokoro is an open-source text-to-speech model with 82 million parameters, delivering sound quality comparable to large models through a lightweight architecture while significantly improving speed and cost efficiency.

Speech Synthesis English

Llasa 1B Q8 0 GGUF

This model is converted from HKUST-Audio/Llasa-1B into GGUF format, primarily designed for text-to-speech tasks.

Speech Synthesis Supports Multiple Languages

Hindi Text To Speech Tts

Hindi text-to-speech model fine-tuned based on microsoft/speecht5_tts

Speech Synthesis

XTTS V2 Argentinian Spanish

ⓍTTS is a speech generation model that can clone voices with just 6 seconds of audio and apply them to different languages. No need for hours of extensive training data.

Speech Synthesis Spanish

Mms Tts Nova Train

This is a Shan language text-to-speech (TTS) model designed to convert Shan text into natural speech.

Speech Synthesis

Transformers Other

Speecht5 Tts Commonvoice Ca

Catalan text-to-speech model based on the SpeechT5 architecture, fine-tuned on the Common Voice 11.0 dataset

Speech Synthesis

Transformers Other

HiFiGAN is a Generative Adversarial Network (GAN) model capable of generating high-quality audio from mel-spectrograms, suitable for text-to-speech systems.

Speech Synthesis English

A HiFi-GAN vocoder model trained on the LJ Speech dataset for high-quality speech synthesis

Speech Synthesis

Transformers English

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase