# Multimodal Speech Understanding
Ultravox V0 5 Llama 3 3 70b Tempfix
MIT
Ultravox is a multimodal speech large language model capable of receiving both speech and text as input, supporting multiple languages and tasks.
Text-to-Audio
Transformers Supports Multiple Languages

U
zhuexe
35
0
Ultravox V0 3
MIT
Ultravox is a multimodal speech large language model based on Llama3.1-8B-Instruct and Whisper-small, capable of processing both speech and text inputs.
Audio-to-Text
Transformers English

U
FriendliAI
20
1
Ultravox V0 4 1 Llama 3 3 70b
MIT
Ultravox is a multimodal speech large language model based on Llama3.3-70B-Instruct and whisper-large-v3-turbo, capable of processing both speech and text inputs.
Audio-to-Text
Transformers Supports Multiple Languages

U
fixie-ai
26
10
Ultravox V0 4 1 Mistral Nemo
MIT
Ultravox is a multimodal model based on Mistral-Nemo and Whisper, capable of processing both speech and text inputs, suitable for tasks like voice agents and speech translation.
Audio-to-Text
Transformers Supports Multiple Languages

U
fixie-ai
1,285
25
Ultravox V0 4 1 Llama 3 1 70b
MIT
Ultravox is a multimodal speech large language model, built upon the pre-trained Llama3.1-70B-Instruct and whisper-large-v3-turbo backbones, capable of receiving both speech and text as inputs.
Text-to-Audio
Transformers Supports Multiple Languages

U
fixie-ai
204
24
Ultravox V0 4 1 Llama 3 1 8b
MIT
Ultravox is a multimodal speech large language model built on Llama3.1-8B-Instruct and whisper-large-v3-turbo, capable of processing both speech and text inputs.
Audio-to-Text
Transformers Supports Multiple Languages

U
fixie-ai
747
97
Speechllm 1.5B
Apache-2.0
SpeechLLM is a multimodal large language model designed to predict speaker turn metadata in conversations, including speech activity, transcribed text, gender, age, accent, and emotion.
Text-to-Audio
Transformers English

S
skit-ai
40
7
Featured Recommended AI Models