U

Ultravox V0 4 Llama 3 1 70b

Developed by fixie-ai
Ultravox is a multimodal speech large language model, built upon the pre-trained Llama3.1-70B-Instruct and Whisper-medium backbones, capable of simultaneously receiving both speech and text as input.
Downloads 79
Release Time : 9/10/2024

Model Overview

Ultravox is a multimodal model capable of simultaneously receiving both speech and text as input (e.g., text system prompts and speech user messages). The model input is a text prompt containing a special pseudo-token `<|audio|>`, which the model processor replaces with embeddings generated from the input audio.

Model Features

Multimodal Input
Capable of simultaneously receiving both speech and text as input, suitable for various interactive scenarios.
High-Performance Speech Recognition
Based on the Whisper-medium encoder, providing high-quality speech recognition capabilities.
Knowledge Distillation
Utilizing a knowledge distillation loss function, Ultravox attempts to match the logical output of the text-based Llama backbone.

Model Capabilities

Speech Recognition
Text Generation
Multimodal Interaction
Speech-to-Speech Translation
Speech Audio Analysis

Use Cases

Speech Agents
Voice Assistant
Used as a voice agent to answer user questions.
Translation
Speech-to-Speech Translation
Supports multilingual speech translation tasks.
English-to-German BLEU 30.30, Spanish-to-English BLEU 39.55
Speech Analysis
Speech Audio Analysis
Analyzes speech content to extract key information.
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
Š 2025AIbase