U

Ultravox V0 6 Gemma 3 27b

Developed by fixie-ai
Ultravox is a multimodal large speech language model that can process both speech and text inputs simultaneously, providing strong support for speech interaction scenarios.
Downloads 641
Release Time : 6/20/2025

Model Overview

Ultravox is built around pre-trained large language models (such as Llama, Gemma, Qwen, etc.) and speech encoders. It can understand speech inputs and generate text, suitable for scenarios such as speech agents and speech translation.

Model Features

Multimodal input support
Supports speech and text as inputs, and processes speech inputs through a special <|audio|> pseudo-token
Language performance optimization
The v0.6 series is trained on Hindi speech data, significantly improving the speech understanding performance of Hindi
Enhanced noise resistance
Trained on a noise dataset, it improves the robustness to noise and can recognize noisy audio
Future speech output support
Plans to expand the vocabulary to support the generation of semantic and acoustic audio tokens to achieve the speech output function

Model Capabilities

Speech understanding
Text generation
Speech-to-speech translation
Speech audio analysis
Noise recognition

Use Cases

Speech interaction
Speech agent
As an intelligent agent capable of understanding speech inputs
Language translation
Speech-to-speech translation
Translate the speech of one language into the text of another language
Performs well on the covost2 dataset, such as BLEU 12.94 from English to Arabic
Audio analysis
Noise detection
Identify whether the input audio contains clear speech or is just noise
The recall rate reaches 97.45% on the musan_noise dataset
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
Š 2025AIbase