đ Voila: Voice-Language Foundation Models
Voila is a new family of large voice - language foundation models. It aims to elevate human - AI interaction experiences. By breaking the constraints of traditional voice AI systems, such as high latency, loss of vocal nuances, and mechanical responses, Voila uses an innovative end - to - end model design and a novel hierarchical Transformer architecture. This enables real - time, autonomous, and rich voice interactions with low latency, and it excels in various audio tasks.

Voila: Voice-Language Foundation Models
Project Page    |    GitHub    |    Hugging Face   |    Paper    |    Online Demo   |    Maitrix.org
⨠Features
- High - fidelity, low - latency, real - time streaming audio processing
- Effective integration of voice and language modeling capabilities
- Millions of pre - built and custom voices, fast voice switching during conversation
- Unified model for various audio tasks
đĻ Installation
No installation steps are provided in the original document, so this section is skipped.
đģ Usage Examples
Basic Usage
CLI demo
for model_name in "maitrix-org/Voila-audio-alpha" "maitrix-org/Voila-base" "maitrix-org/Voila-chat"; do
# Text chat
python infer.py \
--model-name ${model_name} \
--instruction "" \
--input-text "Hello" \
--task-type chat_tito
# Voice chat
python infer.py \
--model-name ${model_name} \
--instruction "" \
--input-audio "examples/test1.mp3" \
--task-type chat_aiao
done
# Autonomous mode
python infer.py \
--model-name "maitrix-org/Voila-autonomous-preview" \
--instruction "" \
--input-audio "examples/test_autonomous1.mp3" \
--task-type chat_aiao_auto
Gradio demo
python gradio_demo.py
For more information, please refer to the code repository.
đ Documentation
Foundation Models
Property |
Details |
Model Type |
Voila offers multiple models, including Voila - base, Voila - Chat, Voila - Autonomous (preview), Voila - Audio - alpha, and Voila - Tokenizer. |
Download Link |
- Voila - base: https://huggingface.co/maitrix-org/Voila - base
- Voila - Chat: https://huggingface.co/maitrix-org/Voila - chat
- Voila - Autonomous (preview): https://huggingface.co/maitrix-org/Voila - autonomous - preview
- Voila - Audio - alpha: https://huggingface.co/maitrix-org/Voila - audio - alpha
- Voila - Tokenizer: https://huggingface.co/maitrix-org/Voila - Tokenizer
|
Datasets
Property |
Details |
Model Type |
Two datasets are published: Voila Benchmark and Voila Voice Library. |
Download Link |
- Voila Benchmark: https://huggingface.co/datasets/maitrix-org/Voila - Benchmark
- Voila Voice Library: https://huggingface.co/datasets/maitrix-org/Voila - million - voice
|
Benchmark
1. Voila Benchmark
We introduce a novel speech evaluation benchmark called the Voila Benchmark. It is constructed by sampling from five widely used language model evaluation datasets: MMLU, MATH, OpenAI HumanEval, NQ - Open, and GSM8k. We compare our results with SpeechGPT and Moshi.
Model |
Voila Benchmark |
SpeechGPT |
13.29 |
Moshi |
11.45 |
Voila |
30.56 |
(higher is better)
For detailed scores of Voila Benchmark on each specific domain, please refer to our paper (Section 5.1 "Evaluation of Voila Benchmark").
2. Evaluation of ASR
As Voila supports multiple tasks, including Automatic Speech Recognition (ASR), Text - to - Speech (TTS), and spoken question answering, we also evaluate the performance of ASR and TTS.
For ASR, we assess performance on the LibriSpeech test - clean dataset, using Word Error Rate (WER) as our metric. Voila attains a word error rate (WER) of 4.8%, outperforming the 5.7% reported by Moshi. In scenarios where both models utilize LibriSpeech training data, Voila achieves an impressive WER of 2.7%.
Model |
LibriSpeech test - clean (WER) |
Whisper large v2 |
2.7 |
Whisper large v3 |
2.2 |
FastConformer |
3.6 |
VoxtLM |
2.7 |
Moshi |
5.7 |
Voila (w/o LibriSpeech train split) |
4.8 |
Voila (with LibriSpeech train split) |
2.7 |
(lower is better)
3. Evaluation of TTS
For TTS, we follow the evaluation metrics proposed in Vall - E, which involves transcribing the generated audio using HuBERT - Large.
Voila once again leads with a WER of 3.2% (and 2.8% when using LibriSpeech training data).
Model |
LibriSpeech test - clean (WER) |
YourTTS |
7.7 |
Vall - E |
5.9 |
Moshi |
4.7 |
Voila (w/o LibriSpeech train split) |
3.2 |
Voila (with LibriSpeech train split) |
2.8 |
(lower is better)
đ License
This project is licensed under the MIT license.
đ§ Technical Details
No specific technical details are provided in the original document, so this section is skipped.
đ Citation
If you find our work helpful, please cite us.
@article{voila2025,
author = {Yemin Shi, Yu Shu, Siwei Dong, Guangyi Liu, Jaward Sesay, Jingwen Li, Zhiting Hu},
title = {Voila: Voice - Language Foundation Models for Real - Time Autonomous Interaction and Voice Roleplay},
eprint={2505.02707},
archivePrefix={arXiv},
primaryClass={cs.CL},
year = {2025}
}