The Best 49 Text-to-Audio Tools in 2025

Phi 4 Multimodal Instruct
MIT
Phi-4-multimodal-instruct is a lightweight open-source multimodal foundation model that integrates language, vision, and speech research data from Phi-3.5 and 4.0 models. It supports text, image, and audio inputs to generate text outputs, with a context length of 128K tokens.
Text-to-Audio Transformers Supports Multiple Languages
P
microsoft
584.02k
1,329
Ultravox V0 5 Llama 3 2 1b
MIT
Ultravox is a multimodal voice large language model based on Llama3.2-1B and Whisper-large-v3, capable of processing both voice and text inputs.
Text-to-Audio Transformers Supports Multiple Languages
U
fixie-ai
167.25k
21
Seamless M4t V2 Large
SeamlessM4T v2 is a large-scale multilingual multimodal machine translation model released by Facebook, supporting speech and text translation for nearly 100 languages.
Text-to-Audio Transformers Supports Multiple Languages
S
facebook
64.59k
821
Ultravox V0 3
MIT
Ultravox is a multimodal speech large language model built upon Llama3.1-8B-Instruct and Whisper-small, capable of processing both speech and text inputs.
Text-to-Audio Transformers English
U
fixie-ai
48.30k
17
Ultravox V0 5 Llama 3 1 8b
MIT
Ultravox is a multimodal voice large language model built on Llama3.1-8B-Instruct and whisper-large-v3-turbo, capable of processing both voice and text inputs.
Text-to-Audio Transformers Supports Multiple Languages
U
fixie-ai
17.86k
12
Hf Seamless M4t Medium
SeamlessM4T is a multilingual translation model that supports both speech and text input/output, enabling cross-language communication.
Text-to-Audio Transformers
H
facebook
14.74k
30
Granite Speech 3.3 8b
Apache-2.0
A compact and efficient speech-language model designed for Automatic Speech Recognition (ASR) and Automatic Speech Translation (AST), featuring a two-stage design for processing audio and text
Text-to-Audio Transformers English
G
ibm-granite
5,532
35
Voila Tokenizer
MIT
Voila is a large-scale voice-language foundation model series designed to enhance human-computer interaction, supporting multiple audio tasks and languages.
Text-to-Audio Transformers Supports Multiple Languages
V
maitrix-org
4,912
3
Hf Seamless M4t Large
SeamlessM4T is a unified model supporting multilingual speech and text translation, capable of performing speech-to-speech, speech-to-text, text-to-speech, and text-to-text translation tasks.
Text-to-Audio Transformers
H
facebook
4,648
57
Minicpm O 2 6 Int4
The int4 quantized version of MiniCPM-o 2.6, significantly reducing GPU VRAM usage while supporting multimodal processing capabilities.
Text-to-Audio Transformers Other
M
openbmb
4,249
42
Meralion AudioLLM Whisper SEA LION
Other
A speech-to-text large language model customized for Singapore's multilingual and multicultural environment, integrating Whisper-large-v2 speech encoder and SEA-LION V3 text decoder
Text-to-Audio Transformers
M
MERaLiON
2,828
12
Diva Llama 3 V0 8b
DiVA Llama 3 is an end-to-end voice assistant model capable of processing both speech and text inputs, trained using distillation loss.
Text-to-Audio Transformers
D
WillHeld
2,596
34
Voila Chat
MIT
Voila is a brand-new large-scale speech-language foundation model series designed to elevate human-computer interaction to unprecedented levels.
Text-to-Audio Transformers Supports Multiple Languages
V
maitrix-org
2,423
32
Riffusion Model V1
Openrail
Riffusion is a real-time music generation application based on Stable Diffusion technology, capable of generating spectrograms from text input and converting them into audio clips.
Text-to-Audio
R
riffusion
2,354
620
Audiox
AudioX is a unified diffusion transformer model capable of generating audio and music from arbitrary content. It produces high-quality general audio and musical compositions, offers flexible natural language control, and seamlessly handles multimodal inputs.
Text-to-Audio
A
HKUSTAudio
2,189
49
Emova Speech Tokenizer Hf
Apache-2.0
EMOVA Speech Tokenizer is a discrete speech tokenizer supporting both English and Chinese, featuring semantic-acoustic decoupling design and flexible speech style control.
Text-to-Audio Transformers Supports Multiple Languages
E
Emova-ollm
895
2
Llama3.1 Typhoon2 Audio 8b Instruct
Typhoon 2-Audio Edition is an end-to-end speech-to-speech model architecture capable of processing audio, speech, and text inputs while simultaneously generating both text and speech outputs. The model is specifically optimized for Thai language while also supporting English.
Text-to-Audio Transformers Supports Multiple Languages
L
scb10x
664
9
Ultravox V0 6 Gemma 3 27b
MIT
Ultravox is a multimodal large speech language model that can process both speech and text inputs simultaneously, providing strong support for speech interaction scenarios.
Text-to-Audio Transformers Supports Multiple Languages
U
fixie-ai
641
2
Ichigo Llama3.1 S Instruct V0.4
Apache-2.0
A multimodal language model based on Llama-3 architecture, supporting audio and text input understanding with noise robustness and multi-turn dialogue capabilities
Text-to-Audio Safetensors English
I
homebrewltd
486
19
Cnn8rnn W2vmean Audiocaps Grounding
Apache-2.0
This is a text-to-audio grounding model capable of predicting the probability of specific sound events occurring in audio segments.
Text-to-Audio Transformers English
C
wsntxxn
456
2
Text To Music
MIT
A text-conditioned symbolic music generation model based on BART-base architecture that can generate ABC notation scores from natural language descriptions
Text-to-Audio Transformers English
T
sander-wood
405
143
Phi 4 Multimodal Instruct Ko Asr
A Korean automatic speech recognition (ASR) and speech translation (AST) model fine-tuned based on microsoft/Phi-4-multimodal-instruct, demonstrating excellent performance on the zeroth-korean and fleurs datasets.
Text-to-Audio Transformers Korean
P
junnei
354
3
Voila Autonomous Preview
MIT
Voila is a large family of speech-language foundation models designed to enhance human-computer interaction, supporting real-time, low-latency voice interaction and multilingual processing.
Text-to-Audio Transformers Supports Multiple Languages
V
maitrix-org
332
8
Qwen2 Audio 7B Instruct I1 GGUF
Apache-2.0
Weighted/matrix quantized model of Qwen2-Audio-7B-Instruct, supporting English audio-to-text transcription tasks
Text-to-Audio Transformers English
Q
mradermacher
282
0
Speechllm 2B
Apache-2.0
SpeechLLM is a multimodal large language model trained to predict speaker turn metadata in conversations, including speech activity, transcribed text, speaker gender, age, accent, and emotion.
Text-to-Audio Transformers English
S
skit-ai
237
16
Ultravox V0 4 1 Llama 3 1 70b
MIT
Ultravox is a multimodal speech large language model, built upon the pre-trained Llama3.1-70B-Instruct and whisper-large-v3-turbo backbones, capable of receiving both speech and text as inputs.
Text-to-Audio Transformers Supports Multiple Languages
U
fixie-ai
204
24
Ultravox V0 6 Llama 3 3 70b
MIT
Ultravox is a large multimodal speech language model that combines a pre-trained large language model and a speech encoder, capable of handling both speech and text inputs.
Text-to-Audio Transformers Supports Multiple Languages
U
fixie-ai
196
0
Voila Audio Alpha
MIT
Voila is a large family of speech-language foundation models designed to enhance human-computer interaction, supporting real-time, low-latency voice interaction and multilingual processing.
Text-to-Audio Transformers Supports Multiple Languages
V
maitrix-org
175
3
Mustango
Apache-2.0
Mustango is a novel multimodal large language model specifically designed for controllable music generation, combining Latent Diffusion Model (LDM), Flan-T5, and music features to achieve high-quality text-to-music generation.
Text-to-Audio Transformers
M
declare-lab
165
40
Songcomposer Sft
Apache-2.0
A language model based on InternLM2, specifically designed for generating lyrics and melodies in song composition.
Text-to-Audio Transformers Supports Multiple Languages
S
Mar2Ding
113
13
Gazelle V0.2
Apache-2.0
Gazelle v0.2 is a joint speech-language model released by Tincans, supporting English.
Text-to-Audio Transformers English
G
tincans-ai
90
99
SIMS Llama3.2 3B
This model is a fine-tuned speech-language model based on Llama-3.2-3B, focusing on analyzing the scalability of interleaved speech-text SLM and supporting speech and text generation tasks.
Text-to-Audio Transformers English
S
slprl
54
1
SIMS 7B
MIT
A speech-language model based on Qwen2.5-7B extension, supporting speech-text interleaved training and cross-modal generation
Text-to-Audio Transformers English
S
slprl
51
1
Speechgpt 7B Cm
SpeechGPT is a large language model with intrinsic cross-modal dialogue capabilities, capable of perceiving and generating multimodal content, supporting interaction via speech and text.
Text-to-Audio Transformers
S
fnlp
47
7
Riffusion Musiccaps
This is a Riffusion model fine-tuned on the google/MusicCaps dataset, capable of generating music or music-related images based on text prompts.
Text-to-Audio TensorBoard English
R
Hyeon2
47
5
Ichigo Llama3.1 S Instruct V0.4
Apache-2.0
A multimodal language model based on the Llama-3 architecture, supporting audio and text input comprehension with enhanced robustness in noisy environments and multi-turn conversation capabilities.
Text-to-Audio English
I
Menlo
44
20
Ichigo Llama3.1 S Instruct V0.3 Phase 3
Apache-2.0
Ichigo-llama3s is a large language model series that supports both audio and text input, focusing on enhancing speech understanding capabilities and user interaction experience.
Text-to-Audio English
I
homebrewltd
43
35
Speechllm 1.5B
Apache-2.0
SpeechLLM is a multimodal large language model designed to predict speaker turn metadata in conversations, including speech activity, transcribed text, gender, age, accent, and emotion.
Text-to-Audio Transformers English
S
skit-ai
40
7
Seamless M4t V2 Large
SeamlessM4T is a large-scale multilingual multimodal machine translation model supporting speech and text translation in nearly 100 languages.
Text-to-Audio Supports Multiple Languages
S
audo
39
17
Speechgpt 7B Ma
SpeechGPT is a large language model with intrinsic cross-modal dialogue capabilities, capable of perceiving and generating multimodal content based on human instructions.
Text-to-Audio Transformers
S
fnlp
37
5
Ultravox V0 5 Llama 3 3 70b Tempfix
MIT
Ultravox is a multimodal speech large language model capable of receiving both speech and text as input, supporting multiple languages and tasks.
Text-to-Audio Transformers Supports Multiple Languages
U
zhuexe
35
0
Music Generation Model
Apache-2.0
This is a hybrid model created by merging a text generation model and a music generation model, capable of handling both text and music generation tasks.
Text-to-Audio Transformers
M
nagayama0706
27
1
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase