Voila-audio-alpha Open-source Speech Model - Supports Multiple Languages and Enables Real-time Low-latency Voice Interaction

Voila Audio Alpha

Developed by maitrix-org

Voila is a large family of speech-language foundation models designed to enhance human-computer interaction, supporting real-time, low-latency voice interaction and multilingual processing.

Text-to-Audio

Transformers

Supports Multiple LanguagesOpen Source License:MIT #Real-time voice interaction #Multilingual speech synthesis #Low-latency streaming processing

Downloads 175

Release Time : 3/18/2025

Model Overview

Through innovative end-to-end model design and hierarchical Transformer architecture, Voila achieves high-fidelity, low-latency voice interaction and supports various audio tasks, including ASR, TTS, and speech translation.

Model Features

High-Fidelity, Low-Latency

Supports real-time streaming audio processing with latency as low as 195 milliseconds.

Multilingual Support

Supports automatic speech recognition (ASR), text-to-speech (TTS), and speech translation in six languages.

Integration of Speech and Language Modeling

Efficiently integrates speech and language modeling capabilities to provide rich interactive experiences.

Millions of Pre-built Voices

Supports millions of pre-built and customizable voices, allowing quick switching during conversations.

Model Capabilities

Real-time voice interaction

Automatic speech recognition (ASR)

Text-to-speech (TTS)

Speech translation

Multilingual processing

Use Cases

Voice Interaction

Real-Time Voice Chat

Supports low-latency real-time voice chat, suitable for scenarios like customer service and virtual assistants.

Latency as low as 195 milliseconds, surpassing the average human reaction time.

Speech Synthesis

High-Fidelity Speech Synthesis

Generates natural, high-fidelity speech output, suitable for scenarios like audiobooks and navigation.

Word error rate (WER) of 3.2% (without using LibriSpeech training data).

🚀 Voila: Voice-Language Foundation Models

Voila is a new family of large voice - language foundation models. It aims to elevate human - AI interaction experiences. By breaking free from the limitations of traditional voice AI systems, such as high latency, loss of vocal nuances, and mechanical responses, Voila offers real - time, autonomous, and rich voice interactions. It combines advanced voice and language modeling, excelling in various audio tasks across six languages.

🚀 Quick Start

CLI demo

for model_name in "maitrix-org/Voila-audio-alpha" "maitrix-org/Voila-base" "maitrix-org/Voila-chat"; do
    # Text chat
    python infer.py \
        --model-name ${model_name} \
	    --instruction "" \
	    --input-text "Hello" \
	    --task-type chat_tito
    # Voice chat
    python infer.py \
        --model-name ${model_name} \
	    --instruction "" \
	    --input-audio "examples/test1.mp3" \
	    --task-type chat_aiao
done

# Autonomous mode
python infer.py \
    --model-name "maitrix-org/Voila-autonomous-preview" \
	--instruction "" \
	--input-audio "examples/test_autonomous1.mp3" \
	--task-type chat_aiao_auto

Gradio demo

python gradio_demo.py

For more information, please refer to the code repository.

✨ Features

High - fidelity, low - latency, real - time streaming audio processing
Effective integration of voice and language modeling capabilities
Millions of pre - built and custom voices, fast voice switching during conversation
Unified model for various audio tasks

📦 Installation

No installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

The basic usage can be demonstrated through the CLI and Gradio demos as shown above.

Advanced Usage

There are no advanced usage examples provided in the original document.

📚 Documentation

Foundation Models

Property	Details
Model Type	Voila offers multiple models, including Voila - base, Voila - Chat, Voila - Autonomous (preview), Voila - Audio - alpha, and Voila - Tokenizer.
Download Link	Voila - base: https://huggingface.co/maitrix-org/Voila-base; Voila - Chat: https://huggingface.co/maitrix-org/Voila-chat; Voila - Autonomous (preview): https://huggingface.co/maitrix-org/Voila-autonomous-preview; Voila - Audio - alpha: https://huggingface.co/maitrix-org/Voila-audio-alpha; Voila - Tokenizer: https://huggingface.co/maitrix-org/Voila-Tokenizer

Datasets

We publish the following two datasets: Voila Benchmark and Voila Voice Library. Voila - Benchmark is a novel speech evaluation benchmark, while Voila Voice Library provides millions of pre - built and customizable voices.

Property	Details
Dataset Name	Voila Benchmark, Voila Voice Library
Description	Voila Benchmark: Evaluation of Voila Benchmark; Voila Voice Library: Millions of pre - build voices
Download Link	Voila Benchmark: https://huggingface.co/datasets/maitrix-org/Voila-Benchmark; Voila Voice Library: https://huggingface.co/datasets/maitrix-org/Voila-million-voice

Benchmark

1. Voila Benchmark

We introduce a novel speech evaluation benchmark called the VoilaBenchmark. It is constructed by sampling from five widely used language model evaluation datasets: MMLU, MATH, OpenAI HumanEval, NQ - Open, and GSM8k. We compare our results with SpeechGPT and Moshi.

Model	Voila Benchmark
SpeechGPT	13.29
Moshi	11.45
Voila	30.56

(higher is better)

For detailed scores of Voila Benchmark on each specific domain, please refer to our paper (Section 5.1 "Evaluation of Voila Benchmark").

2. Evaluation of ASR

As Voila supports multiple tasks, including Automatic Speech Recognition (ASR), Text - to - Speech (TTS), and spoken question answering, we also evaluate the performance of ASR and TTS. For ASR, we assess performance on the LibriSpeech test - clean dataset, using Word Error Rate (WER) as our metric. Voila attains a word error rate (WER) of 4.8%, outperforming the 5.7% reported by Moshi. In scenarios where both models utilize LibriSpeech training data, Voila achieves an impressive WER of 2.7%.

Model	LibriSpeech test - clean (WER)
Whisper large v2	2.7
Whisper large v3	2.2
FastConformer	3.6
VoxtLM	2.7
Moshi	5.7
Voila (w/o LibriSpeech train split)	4.8
Voila (with LibriSpeech train split)	2.7

(lower is better)

3. Evaluation of TTS

For TTS, we follow the evaluation metrics proposed in Vall - E, which involves transcribing the generated audio using HuBERT - Large. Voila once again leads with a WER of 3.2% (and 2.8% when using LibriSpeech training data).

Model	LibriSpeech test - clean (WER)
YourTTS	7.7
Vall - E	5.9
Moshi	4.7
Voila (w/o LibriSpeech train split)	3.2
Voila (with LibriSpeech train split)	2.8

(lower is better)

🔧 Technical Details

There are no technical details provided in the original document.

📄 License

The project is licensed under the MIT license.

📚 Citation

If you find our work helpful, please cite us.

@article{voila2025,
  author    = {Yemin Shi, Yu Shu, Siwei Dong, Guangyi Liu, Jaward Sesay, Jingwen Li, Zhiting Hu},
  title     = {Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Roleplay},
  eprint={2505.02707},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  year      = {2025}
}

Video Demo

Latest News!!

April 28, 2025: We've released the inference code and model weights of Voila.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご