Ultravox-v0_4_1-llama-3_1-70b Open-source Model - A Multimodal Voice Assistant Supporting Voice and Text Input

Ultravox V0 4 1 Llama 3 1 70b

Developed by fixie-ai

Ultravox is a multimodal speech large language model, built upon the pre-trained Llama3.1-70B-Instruct and whisper-large-v3-turbo backbones, capable of receiving both speech and text as inputs.

Text-to-Audio

Transformers

Supports Multiple LanguagesOpen Source License:MIT #Multimodal Speech Understanding #Multilingual Speech Translation #Low-latency Speech Interaction

Downloads 204

Release Time : 11/5/2024

Model Overview

Ultravox is a multimodal model that can simultaneously receive speech and text as inputs (e.g., text system prompts and speech user messages). The model's input is a text prompt containing special pseudo-tokens, which the model processor replaces with the embedding representation of the input audio.

Model Features

Multimodal Input

Can receive both speech and text as inputs, processing text prompts containing audio embeddings.

Multilingual Support

Supports speech and text processing in 15 languages, including Chinese, English, Spanish, etc.

Knowledge Distillation Training

Supervised speech instruction fine-tuning via knowledge distillation to match the logical output of the text-based Llama backbone.

Model Capabilities

Speech Recognition

Text Generation

Multilingual Translation

Speech Audio Analysis

Use Cases

Speech Agent

Voice Assistant

Used as a speech agent to answer user questions.

Speech Translation

Speech-to-Speech Translation

Supports speech translation between multiple languages.

Achieved a BLEU score of 19.64 in English-Arabic translation

🚀 Ultravox Model Card

Ultravox is a multimodal Speech LLM. It combines a pretrained [Llama3.1 - 70B - Instruct](https://huggingface.co/meta - llama/Llama-3.1-70B-Instruct) and whisper - large - v3 - turbo backbone. It can handle both speech and text inputs, offering capabilities like voice - based interactions and speech - to - speech translation.

For the GitHub repo and more information, visit https://ultravox.ai.

📚 Documentation

Model Details

Model Description

Ultravox is a multimodal model that accepts both speech and text as input. For example, it can take a text system prompt and a voice user message. The model receives a text prompt with a special <|audio|> pseudo - token. The model processor then replaces this token with embeddings derived from the input audio. Using these merged embeddings as input, the model generates output text as usual.

In a future version of Ultravox, we plan to expand the token vocabulary to support the generation of semantic and acoustic audio tokens. These tokens can then be fed to a vocoder to produce voice output. No preference tuning has been applied to this version of the model.

Developed by: Fixie.ai
License: MIT

Model Sources

Repository: https://ultravox.ai
Demo: See repo

Usage

Think of the model as an LLM that can also process and understand speech. It can be used as a voice agent, for speech - to - speech translation, and for analyzing spoken audio.

To use the model, try the following code:

💻 Usage Examples

Basic Usage

# pip install transformers peft librosa

import transformers
import numpy as np
import librosa

pipe = transformers.pipeline(model='fixie-ai/ultravox-v0_4_1-llama-3_1-70b', trust_remote_code=True)

path = "<path-to-input-audio>"  # TODO: pass the audio here
audio, sr = librosa.load(path, sr=16000)


turns = [
  {
    "role": "system",
    "content": "You are a friendly and helpful character. You love to answer questions for people."
  },
]
pipe({'audio': audio, 'turns': turns, 'sampling_rate': sr}, max_new_tokens=30)

Training Details

Training Data

The training dataset is a combination of ASR datasets, extended with continuations generated by Llama 3.1 8B, and speech translation datasets. This combination leads to a modest improvement in translation evaluations.

Training Procedure

The model uses supervised speech instruction finetuning via knowledge - distillation. For more details, refer to the [training code in the Ultravox repo](https://github.com/fixie - ai/ultravox/blob/main/ultravox/training/train.py).

Training Hyperparameters

Training regime: BF16 mixed precision training
Hardware used: 8x H100 GPUs

Speeds, Sizes, Times

When using an A100 - 40GB GPU and a Llama 3.1 8B backbone, the current version of Ultravox has a time - to - first - token (TTFT) of approximately 150ms and a tokens - per - second rate of about 50 - 100 when invoked with audio content.

Check out the audio tab on TheFastest.ai for daily benchmarks and comparisons with other existing models.

Evaluation

	Ultravox 0.4 70B	Ultravox 0.4.1 70B
en_ar	14.97	19.64
en_de	30.30	32.47
es_en	39.55	40.76
ru_en	44.16	45.07
en_ca	35.02	37.58
zh_en	12.16	17.98

Additional Information

Property	Details
Library Name	transformers
Datasets	fixie - ai/librispeech_asr, fixie - ai/common_voice_17_0, fixie - ai/peoples_speech, fixie - ai/gigaspeech, fixie - ai/multilingual_librispeech, fixie - ai/wenetspeech, fixie - ai/covost2
Metrics	bleu
Pipeline Tag	audio - text - to - text
Supported Languages	ar, de, en, es, fr, hi, it, ja, nl, pt, ru, sv, tr, uk, zh
License	MIT

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご