U

Ultravox V0 3

Developed by FriendliAI
Ultravox is a multimodal speech large language model based on Llama3.1-8B-Instruct and Whisper-small, capable of processing both speech and text inputs.
Downloads 20
Release Time : 3/19/2025

Model Overview

Ultravox is a multimodal model that can receive speech and text inputs (such as system text prompts and user voice messages) and generate text outputs. Future versions plan to support generating semantic and acoustic audio tokens for speech output.

Model Features

Multimodal Input
Can simultaneously receive speech and text inputs, merging audio embedding vectors with text prompts via special pseudo-tokens <|audio|>.
Speech Understanding
Capable of understanding speech content and generating corresponding text outputs, suitable for tasks like voice agents and speech translation.
Knowledge Distillation
Uses a knowledge distillation loss function to align the model's outputs with the logic of the text-based Llama backbone network.

Model Capabilities

Speech Recognition
Text Generation
Speech-to-Text Translation
Speech Analysis

Use Cases

Voice Agents
Voice Assistant
Acts as a voice agent to answer user questions, providing a friendly and helpful interaction experience.
Speech Translation
English-German Translation
Translates English speech into German text.
BLEU score 22.68
Spanish-English Translation
Translates Spanish speech into English text.
BLEU score 24.10
Speech Recognition
LibriSpeech Test
Performs speech recognition on the LibriSpeech clean test set.
WER 6.67
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
Š 2025AIbase