U

Ultravox V0 4 1 Llama 3 1 70b

Developed by fixie-ai
Ultravox is a multimodal speech large language model, built upon the pre-trained Llama3.1-70B-Instruct and whisper-large-v3-turbo backbones, capable of receiving both speech and text as inputs.
Downloads 204
Release Time : 11/5/2024

Model Overview

Ultravox is a multimodal model that can simultaneously receive speech and text as inputs (e.g., text system prompts and speech user messages). The model's input is a text prompt containing special pseudo-tokens, which the model processor replaces with the embedding representation of the input audio.

Model Features

Multimodal Input
Can receive both speech and text as inputs, processing text prompts containing audio embeddings.
Multilingual Support
Supports speech and text processing in 15 languages, including Chinese, English, Spanish, etc.
Knowledge Distillation Training
Supervised speech instruction fine-tuning via knowledge distillation to match the logical output of the text-based Llama backbone.

Model Capabilities

Speech Recognition
Text Generation
Multilingual Translation
Speech Audio Analysis

Use Cases

Speech Agent
Voice Assistant
Used as a speech agent to answer user questions.
Speech Translation
Speech-to-Speech Translation
Supports speech translation between multiple languages.
Achieved a BLEU score of 19.64 in English-Arabic translation
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase