Text-to-Audio

The Best 49 Text-to-Audio Tools in 2025

Phi 4 Multimodal Instruct

Phi-4-multimodal-instruct is a lightweight open-source multimodal foundation model that integrates language, vision, and speech research data from Phi-3.5 and 4.0 models. It supports text, image, and audio inputs to generate text outputs, with a context length of 128K tokens.

Transformers Supports Multiple Languages

Ultravox V0 5 Llama 3 2 1b

Ultravox is a multimodal voice large language model based on Llama3.2-1B and Whisper-large-v3, capable of processing both voice and text inputs.

Transformers Supports Multiple Languages

Seamless M4t V2 Large

SeamlessM4T v2 is a large-scale multilingual multimodal machine translation model released by Facebook, supporting speech and text translation for nearly 100 languages.

Transformers Supports Multiple Languages

Ultravox is a multimodal speech large language model built upon Llama3.1-8B-Instruct and Whisper-small, capable of processing both speech and text inputs.

Transformers English

Ultravox V0 5 Llama 3 1 8b

Ultravox is a multimodal voice large language model built on Llama3.1-8B-Instruct and whisper-large-v3-turbo, capable of processing both voice and text inputs.

Transformers Supports Multiple Languages

Hf Seamless M4t Medium

SeamlessM4T is a multilingual translation model that supports both speech and text input/output, enabling cross-language communication.

Granite Speech 3.3 8b

A compact and efficient speech-language model designed for Automatic Speech Recognition (ASR) and Automatic Speech Translation (AST), featuring a two-stage design for processing audio and text

Transformers English

Voila Tokenizer

Voila is a large-scale voice-language foundation model series designed to enhance human-computer interaction, supporting multiple audio tasks and languages.

Transformers Supports Multiple Languages

Hf Seamless M4t Large

SeamlessM4T is a unified model supporting multilingual speech and text translation, capable of performing speech-to-speech, speech-to-text, text-to-speech, and text-to-text translation tasks.

Minicpm O 2 6 Int4

The int4 quantized version of MiniCPM-o 2.6, significantly reducing GPU VRAM usage while supporting multimodal processing capabilities.

Transformers Other

Meralion AudioLLM Whisper SEA LION

A speech-to-text large language model customized for Singapore's multilingual and multicultural environment, integrating Whisper-large-v2 speech encoder and SEA-LION V3 text decoder

Diva Llama 3 V0 8b

DiVA Llama 3 is an end-to-end voice assistant model capable of processing both speech and text inputs, trained using distillation loss.

Voila is a brand-new large-scale speech-language foundation model series designed to elevate human-computer interaction to unprecedented levels.

Transformers Supports Multiple Languages

Riffusion Model V1

Riffusion is a real-time music generation application based on Stable Diffusion technology, capable of generating spectrograms from text input and converting them into audio clips.

AudioX is a unified diffusion transformer model capable of generating audio and music from arbitrary content. It produces high-quality general audio and musical compositions, offers flexible natural language control, and seamlessly handles multimodal inputs.

Emova Speech Tokenizer Hf

EMOVA Speech Tokenizer is a discrete speech tokenizer supporting both English and Chinese, featuring semantic-acoustic decoupling design and flexible speech style control.

Transformers Supports Multiple Languages

Llama3.1 Typhoon2 Audio 8b Instruct

Typhoon 2-Audio Edition is an end-to-end speech-to-speech model architecture capable of processing audio, speech, and text inputs while simultaneously generating both text and speech outputs. The model is specifically optimized for Thai language while also supporting English.

Transformers Supports Multiple Languages

Ultravox V0 6 Gemma 3 27b

Ultravox is a multimodal large speech language model that can process both speech and text inputs simultaneously, providing strong support for speech interaction scenarios.

Transformers Supports Multiple Languages

Ichigo Llama3.1 S Instruct V0.4

A multimodal language model based on Llama-3 architecture, supporting audio and text input understanding with noise robustness and multi-turn dialogue capabilities

Safetensors English

Cnn8rnn W2vmean Audiocaps Grounding

This is a text-to-audio grounding model capable of predicting the probability of specific sound events occurring in audio segments.

Transformers English

A text-conditioned symbolic music generation model based on BART-base architecture that can generate ABC notation scores from natural language descriptions

Transformers English

Phi 4 Multimodal Instruct Ko Asr

A Korean automatic speech recognition (ASR) and speech translation (AST) model fine-tuned based on microsoft/Phi-4-multimodal-instruct, demonstrating excellent performance on the zeroth-korean and fleurs datasets.

Transformers Korean

Voila Autonomous Preview

Voila is a large family of speech-language foundation models designed to enhance human-computer interaction, supporting real-time, low-latency voice interaction and multilingual processing.

Transformers Supports Multiple Languages

Qwen2 Audio 7B Instruct I1 GGUF

Weighted/matrix quantized model of Qwen2-Audio-7B-Instruct, supporting English audio-to-text transcription tasks

Transformers English

SpeechLLM is a multimodal large language model trained to predict speaker turn metadata in conversations, including speech activity, transcribed text, speaker gender, age, accent, and emotion.

Transformers English

Ultravox V0 4 1 Llama 3 1 70b

Ultravox is a multimodal speech large language model, built upon the pre-trained Llama3.1-70B-Instruct and whisper-large-v3-turbo backbones, capable of receiving both speech and text as inputs.

Transformers Supports Multiple Languages

Ultravox V0 6 Llama 3 3 70b

Ultravox is a large multimodal speech language model that combines a pre-trained large language model and a speech encoder, capable of handling both speech and text inputs.

Transformers Supports Multiple Languages

Voila Audio Alpha

Voila is a large family of speech-language foundation models designed to enhance human-computer interaction, supporting real-time, low-latency voice interaction and multilingual processing.

Transformers Supports Multiple Languages

Mustango is a novel multimodal large language model specifically designed for controllable music generation, combining Latent Diffusion Model (LDM), Flan-T5, and music features to achieve high-quality text-to-music generation.

Songcomposer Sft

A language model based on InternLM2, specifically designed for generating lyrics and melodies in song composition.

Transformers Supports Multiple Languages

Gazelle v0.2 is a joint speech-language model released by Tincans, supporting English.

Transformers English

SIMS Llama3.2 3B

This model is a fine-tuned speech-language model based on Llama-3.2-3B, focusing on analyzing the scalability of interleaved speech-text SLM and supporting speech and text generation tasks.

Transformers English

A speech-language model based on Qwen2.5-7B extension, supporting speech-text interleaved training and cross-modal generation

Transformers English

Speechgpt 7B Cm

SpeechGPT is a large language model with intrinsic cross-modal dialogue capabilities, capable of perceiving and generating multimodal content, supporting interaction via speech and text.

Riffusion Musiccaps

This is a Riffusion model fine-tuned on the google/MusicCaps dataset, capable of generating music or music-related images based on text prompts.

TensorBoard English

Ichigo Llama3.1 S Instruct V0.4

A multimodal language model based on the Llama-3 architecture, supporting audio and text input comprehension with enhanced robustness in noisy environments and multi-turn conversation capabilities.

Text-to-Audio English

Ichigo Llama3.1 S Instruct V0.3 Phase 3

Ichigo-llama3s is a large language model series that supports both audio and text input, focusing on enhancing speech understanding capabilities and user interaction experience.

Text-to-Audio English

SpeechLLM is a multimodal large language model designed to predict speaker turn metadata in conversations, including speech activity, transcribed text, gender, age, accent, and emotion.

Transformers English

Seamless M4t V2 Large

SeamlessM4T is a large-scale multilingual multimodal machine translation model supporting speech and text translation in nearly 100 languages.

Text-to-Audio Supports Multiple Languages

Speechgpt 7B Ma

SpeechGPT is a large language model with intrinsic cross-modal dialogue capabilities, capable of perceiving and generating multimodal content based on human instructions.

Ultravox V0 5 Llama 3 3 70b Tempfix

Ultravox is a multimodal speech large language model capable of receiving both speech and text as input, supporting multiple languages and tasks.

Transformers Supports Multiple Languages

Music Generation Model

This is a hybrid model created by merging a text generation model and a music generation model, capable of handling both text and music generation tasks.

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase