🚀 Ultravox Model Card
Ultravox is a multimodal Speech LLM. It combines a pretrained [Llama3.1 - 70B - Instruct](https://huggingface.co/meta - llama/Llama-3.1-70B-Instruct) and whisper - large - v3 - turbo backbone. It can handle both speech and text inputs, offering capabilities like voice - based interactions and speech - to - speech translation.
For the GitHub repo and more information, visit https://ultravox.ai.
📚 Documentation
Model Details
Model Description
Ultravox is a multimodal model that accepts both speech and text as input. For example, it can take a text system prompt and a voice user message. The model receives a text prompt with a special <|audio|>
pseudo - token. The model processor then replaces this token with embeddings derived from the input audio. Using these merged embeddings as input, the model generates output text as usual.
In a future version of Ultravox, we plan to expand the token vocabulary to support the generation of semantic and acoustic audio tokens. These tokens can then be fed to a vocoder to produce voice output. No preference tuning has been applied to this version of the model.
- Developed by: Fixie.ai
- License: MIT
Model Sources
- Repository: https://ultravox.ai
- Demo: See repo
Usage
Think of the model as an LLM that can also process and understand speech. It can be used as a voice agent, for speech - to - speech translation, and for analyzing spoken audio.
To use the model, try the following code:
💻 Usage Examples
Basic Usage
import transformers
import numpy as np
import librosa
pipe = transformers.pipeline(model='fixie-ai/ultravox-v0_4_1-llama-3_1-70b', trust_remote_code=True)
path = "<path-to-input-audio>"
audio, sr = librosa.load(path, sr=16000)
turns = [
{
"role": "system",
"content": "You are a friendly and helpful character. You love to answer questions for people."
},
]
pipe({'audio': audio, 'turns': turns, 'sampling_rate': sr}, max_new_tokens=30)
Training Details
Training Data
The training dataset is a combination of ASR datasets, extended with continuations generated by Llama 3.1 8B, and speech translation datasets. This combination leads to a modest improvement in translation evaluations.
Training Procedure
The model uses supervised speech instruction finetuning via knowledge - distillation. For more details, refer to the [training code in the Ultravox repo](https://github.com/fixie - ai/ultravox/blob/main/ultravox/training/train.py).
Training Hyperparameters
- Training regime: BF16 mixed precision training
- Hardware used: 8x H100 GPUs
Speeds, Sizes, Times
When using an A100 - 40GB GPU and a Llama 3.1 8B backbone, the current version of Ultravox has a time - to - first - token (TTFT) of approximately 150ms and a tokens - per - second rate of about 50 - 100 when invoked with audio content.
Check out the audio tab on TheFastest.ai for daily benchmarks and comparisons with other existing models.
Evaluation
|
Ultravox 0.4 70B |
Ultravox 0.4.1 70B |
en_ar |
14.97 |
19.64 |
en_de |
30.30 |
32.47 |
es_en |
39.55 |
40.76 |
ru_en |
44.16 |
45.07 |
en_ca |
35.02 |
37.58 |
zh_en |
12.16 |
17.98 |
Additional Information
Property |
Details |
Library Name |
transformers |
Datasets |
fixie - ai/librispeech_asr, fixie - ai/common_voice_17_0, fixie - ai/peoples_speech, fixie - ai/gigaspeech, fixie - ai/multilingual_librispeech, fixie - ai/wenetspeech, fixie - ai/covost2 |
Metrics |
bleu |
Pipeline Tag |
audio - text - to - text |
Supported Languages |
ar, de, en, es, fr, hi, it, ja, nl, pt, ru, sv, tr, uk, zh |
License |
MIT |