Voila employs innovative end-to-end model design and a novel hierarchical Transformer architecture to achieve real-time, autonomous, and rich voice interactions with latency as low as 195 milliseconds. Combining advanced speech and language modeling techniques, Voila offers customizable, character-driven interaction experiences and excels in a range of audio tasks from ASR and TTS to speech translation in six languages.
Model Features
High-Fidelity, Low-Latency
Achieves real-time streaming audio processing with latency as low as 195 milliseconds
Integration of Speech and Language Modeling
Effectively integrates speech and language modeling capabilities
Multi-Voice Support
Offers millions of pre-built and custom voices, enabling rapid voice switching during conversations
Unified Model for Multiple Tasks
A single model handles diverse audio tasks
Model Capabilities
Speech Recognition
Text-to-Speech
Speech Translation
Voice Dialogue
Audio Understanding
Use Cases
Human-Computer Interaction
Real-Time Voice Dialogue
Enables low-latency natural voice conversations
Latency as low as 195 ms, surpassing average human response time
Speech Processing
Multilingual Speech Translation
Supports speech translation in six languages
đ Voila: Voice-Language Foundation Models
Voila is a new family of large voice - language foundation models. It aims to elevate human - AI interaction experiences. By breaking free from the limitations of traditional voice AI systems, such as high latency, loss of vocal nuances, and mechanical responses, Voila uses an innovative end - to - end model design and a novel hierarchical Transformer architecture. This enables real - time, autonomous, and rich voice interactions with extremely low latency.
We publish the following two datasets: Voila Benchmark and Voila Voice Library. Voila - Benchmark is a novel speech evaluation benchmark, while Voila Voice Library provides millions of pre - built and customizable voices.
Voila Voice Library: [https://huggingface.co/datasets/maitrix - org/Voila - million - voice](https://huggingface.co/datasets/maitrix - org/Voila - million - voice)
Benchmark
1. Voila Benchmark
We introduce a novel speech evaluation benchmark called the VoilaBenchmark. The Voila Benchmark is constructed by sampling from five widely used language model evaluation datasets: MMLU, MATH, OpenAI HumanEval, NQ - Open, and GSM8k. We compare our results with SpeechGPT and Moshi.
Model
Voila Benchmark
SpeechGPT
13.29
Moshi
11.45
Voila
30.56
(higher is better)
For detailed scores of Voila Benchmark on each specific domain, please refer to our paper (Section 5.1 "Evaluation of Voila Benchmark").
2. Evaluation of ASR
As Voila supports multiple tasks, including Automatic Speech Recognition (ASR), Text - to - Speech(TTS), and spoken question answering, we also evaluate the performance of ASR and TTS.
For ASR, we assess performance on the LibriSpeech test - clean dataset, using Word Error Rate (WER) as our metric. Voila attains a word error rate (WER) of 4.8%, outperforming the 5.7% reported by Moshi. In scenarios where both models utilize LibriSpeech training data, Voila achieves an impressive WER of 2.7%.
Model
LibriSpeech test - clean (WER)
Whisper large v2
2.7
Whisper large v3
2.2
FastConformer
3.6
VoxtLM
2.7
Moshi
5.7
Voila (w/o LibriSpeech train split)
4.8
Voila (with LibriSpeech train split)
2.7
(lower is better)
3. Evaluation of TTS
For TTS, we follow the evaluation metrics proposed in Vall - E, which involves transcribing the generated audio using HuBERT - Large.
Voila once again leads with a WER of 3.2% (and 2.8% when using LibriSpeech training data).
Model
LibriSpeech test - clean (WER)
YourTTS
7.7
Vall - E
5.9
Moshi
4.7
Voila (w/o LibriSpeech train split)
3.2
Voila (with LibriSpeech train split)
2.8
(lower is better)
đ§ Technical Details
Voila employs an innovative end - to - end model design and a novel hierarchical Transformer architecture. This approach enables real - time, autonomous, and rich voice interactions, with latency as low as 195 ms, surpassing average human response times.
đ License
The project uses the MIT license.
đ Citation
If you find our work helpful, please cite us.
@article{voila2025,
author = {Yemin Shi, Yu Shu, Siwei Dong, Guangyi Liu, Jaward Sesay, Jingwen Li, Zhiting Hu},
title = {Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Roleplay},
eprint={2505.02707},
archivePrefix={arXiv},
primaryClass={cs.CL},
year = {2025}
}