Voila-autonomous-preview Open-source Speech-Language Model - Supports Real-time Multilingual Speech Interaction and Enhances Human-Computer Experience

Voila Autonomous Preview

Developed by maitrix-org

Voila is a large family of speech-language foundation models designed to enhance human-computer interaction, supporting real-time, low-latency voice interaction and multilingual processing.

Text-to-Audio

Transformers

Supports Multiple LanguagesOpen Source License:MIT #Real-time voice interaction #Multilingual speech synthesis #End-to-end audio processing

Downloads 332

Release Time : 3/18/2025

Model Overview

Voila adopts an innovative end-to-end model design and hierarchical Transformer architecture, supporting automatic speech recognition (ASR), text-to-speech (TTS), and speech translation in six languages, delivering high-fidelity, low-latency voice interaction experiences.

Model Features

High-Fidelity, Low-Latency

Supports real-time streaming audio processing with latency as low as 195 milliseconds, surpassing the average human response time.

Integration of Speech and Language Modeling

Efficiently integrates speech and language modeling capabilities to deliver rich interactive experiences.

Multi-Voice Support

Offers millions of pre-built and customizable voices, enabling quick voice switching during conversations.

Multi-Task Support

A unified model supporting multiple audio tasks, including ASR, TTS, and speech translation.

Model Capabilities

Automatic Speech Recognition (ASR)

Text-to-Speech (TTS)

Speech Translation

Real-time Voice Interaction

Multilingual Processing

Use Cases

Voice Interaction

Real-Time Voice Chat

Supports low-latency real-time voice chat, suitable for customer service, virtual assistants, and other scenarios.

Latency as low as 195 milliseconds, delivering natural and smooth interaction experiences.

Multilingual Processing

Multilingual Speech Translation

Supports speech translation in six languages, suitable for cross-language communication scenarios.

Achieves a word error rate (WER) of 4.8% on the LibriSpeech test set.

🚀 Voila: Voice-Language Foundation Models

Voila is a new family of large voice - language foundation models. It aims to elevate human - AI interaction experiences. By breaking the constraints of traditional voice AI systems, such as high latency, loss of vocal nuances, and mechanical responses, Voila uses an innovative end - to - end model design and a novel hierarchical Transformer architecture. This enables real - time, autonomous, and rich voice interactions with low latency, and it excels in various audio tasks.

✨ Features

High - fidelity, low - latency, real - time streaming audio processing
Effective integration of voice and language modeling capabilities
Millions of pre - built and custom voices, fast voice switching during conversation
Unified model for various audio tasks

📦 Installation

No installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

Basic Usage

CLI demo

for model_name in "maitrix-org/Voila-audio-alpha" "maitrix-org/Voila-base" "maitrix-org/Voila-chat"; do
    # Text chat
    python infer.py \
        --model-name ${model_name} \
	    --instruction "" \
	    --input-text "Hello" \
	    --task-type chat_tito
    # Voice chat
    python infer.py \
        --model-name ${model_name} \
	    --instruction "" \
	    --input-audio "examples/test1.mp3" \
	    --task-type chat_aiao
done

# Autonomous mode
python infer.py \
    --model-name "maitrix-org/Voila-autonomous-preview" \
	--instruction "" \
	--input-audio "examples/test_autonomous1.mp3" \
	--task-type chat_aiao_auto

Gradio demo

python gradio_demo.py

For more information, please refer to the code repository.

📚 Documentation

Foundation Models

Property	Details
Model Type	Voila offers multiple models, including Voila - base, Voila - Chat, Voila - Autonomous (preview), Voila - Audio - alpha, and Voila - Tokenizer.
Download Link	Voila - base: https://huggingface.co/maitrix-org/Voila - base Voila - Chat: https://huggingface.co/maitrix-org/Voila - chat Voila - Autonomous (preview): https://huggingface.co/maitrix-org/Voila - autonomous - preview Voila - Audio - alpha: https://huggingface.co/maitrix-org/Voila - audio - alpha Voila - Tokenizer: https://huggingface.co/maitrix-org/Voila - Tokenizer

Datasets

Property	Details
Model Type	Two datasets are published: Voila Benchmark and Voila Voice Library.
Download Link	Voila Benchmark: https://huggingface.co/datasets/maitrix-org/Voila - Benchmark Voila Voice Library: https://huggingface.co/datasets/maitrix-org/Voila - million - voice

Benchmark

1. Voila Benchmark

We introduce a novel speech evaluation benchmark called the Voila Benchmark. It is constructed by sampling from five widely used language model evaluation datasets: MMLU, MATH, OpenAI HumanEval, NQ - Open, and GSM8k. We compare our results with SpeechGPT and Moshi.

Model	Voila Benchmark
SpeechGPT	13.29
Moshi	11.45
Voila	30.56

(higher is better)

For detailed scores of Voila Benchmark on each specific domain, please refer to our paper (Section 5.1 "Evaluation of Voila Benchmark").

2. Evaluation of ASR

As Voila supports multiple tasks, including Automatic Speech Recognition (ASR), Text - to - Speech (TTS), and spoken question answering, we also evaluate the performance of ASR and TTS. For ASR, we assess performance on the LibriSpeech test - clean dataset, using Word Error Rate (WER) as our metric. Voila attains a word error rate (WER) of 4.8%, outperforming the 5.7% reported by Moshi. In scenarios where both models utilize LibriSpeech training data, Voila achieves an impressive WER of 2.7%.

Model	LibriSpeech test - clean (WER)
Whisper large v2	2.7
Whisper large v3	2.2
FastConformer	3.6
VoxtLM	2.7
Moshi	5.7
Voila (w/o LibriSpeech train split)	4.8
Voila (with LibriSpeech train split)	2.7

(lower is better)

3. Evaluation of TTS

For TTS, we follow the evaluation metrics proposed in Vall - E, which involves transcribing the generated audio using HuBERT - Large. Voila once again leads with a WER of 3.2% (and 2.8% when using LibriSpeech training data).

Model	LibriSpeech test - clean (WER)
YourTTS	7.7
Vall - E	5.9
Moshi	4.7
Voila (w/o LibriSpeech train split)	3.2
Voila (with LibriSpeech train split)	2.8

(lower is better)

📄 License

This project is licensed under the MIT license.

🔧 Technical Details

No specific technical details are provided in the original document, so this section is skipped.

📄 Citation

If you find our work helpful, please cite us.

@article{voila2025,
  author    = {Yemin Shi, Yu Shu, Siwei Dong, Guangyi Liu, Jaward Sesay, Jingwen Li, Zhiting Hu},
  title     = {Voila: Voice - Language Foundation Models for Real - Time Autonomous Interaction and Voice Roleplay},
  eprint={2505.02707},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  year      = {2025}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご