U

Ultravox V0 4

Developed by fixie-ai
Ultravox is a multimodal voice large language model based on Llama3.1-8B-Instruct and Whisper-medium, capable of processing both voice and text inputs simultaneously.
Downloads 1,851
Release Time : 8/23/2024

Model Overview

Ultravox is a multimodal model that can receive voice and text inputs and generate text outputs. It combines the capabilities of speech recognition and large language models, suitable for tasks such as voice agents and voice-to-voice translation.

Model Features

Multimodal input
Can receive both voice and text inputs simultaneously and process audio embeddings via special pseudo-tokens <|audio|>.
Voice agent
Can be used as a voice agent to understand and generate voice content.
Knowledge distillation
Uses a knowledge distillation loss function to align the model's logical outputs with those of the text-based Llama backbone network.

Model Capabilities

Speech recognition
Text generation
Voice-to-voice translation
Spoken audio analysis

Use Cases

Voice agent
Voice assistant
Acts as a voice assistant to answer user questions.
Translation
Voice-to-voice translation
Translates voice in one language into text or voice in another language.
English-German translation BLEU 25.47, Spanish-English translation BLEU 37.11
Speech recognition
Automatic speech recognition
Converts voice content into text.
LibriSpeech clean test set WER 4.45
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
Š 2025AIbase