U

Ultravox V0 4 1 Mistral Nemo

Developed by fixie-ai
Ultravox is a multimodal model based on Mistral-Nemo and Whisper, capable of processing both speech and text inputs, suitable for tasks like voice agents and speech translation.
Downloads 1,285
Release Time : 11/7/2024

Model Overview

Ultravox is a multimodal speech large language model that can receive speech and text as input and generate text output. It combines Mistral-Nemo's language understanding capabilities with Whisper's speech processing abilities.

Model Features

Multimodal Input
Can receive both speech and text inputs, processing audio embeddings via special token <|audio|>
Multilingual Support
Supports speech and text processing in 15 languages
Efficient Inference
First token generation time approximately 150ms, capable of generating 50-100 tokens per second
Knowledge Distillation Training
Uses knowledge distillation loss function to match the logical output of the text-based Mistral backbone model

Model Capabilities

Speech Recognition
Speech Translation
Voice Conversation
Multilingual Processing
Text Generation

Use Cases

Voice Interaction
Voice Agent
Interacts with people as an intelligent agent capable of listening and speaking
Translation Services
Speech-to-Speech Translation
Translates speech from one language into text of another language
Achieved a BLEU score of 28.39 in English-German translation
Speech Analysis
Speech Content Understanding
Analyzes speech content and generates summaries or answers
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
Š 2025AIbase