Voila-chat Open-source Speech Language Model - Free Deployment to Enhance Human-computer Interaction Experience

Voila Chat

Developed by maitrix-org

Voila is a brand-new large-scale speech-language foundation model series designed to elevate human-computer interaction to unprecedented levels.

Text-to-Audio

Transformers

Supports Multiple LanguagesOpen Source License:MIT #Real-time voice interaction #Low-latency audio processing #Multilingual speech synthesis

Downloads 2,423

Release Time : 3/18/2025

Model Overview

Voila employs innovative end-to-end model design and a novel hierarchical Transformer architecture to achieve real-time, autonomous, and rich voice interactions with latency as low as 195 milliseconds. Combining advanced speech and language modeling techniques, Voila offers customizable, character-driven interaction experiences and excels in a range of audio tasks from ASR and TTS to speech translation in six languages.

Model Features

High-Fidelity, Low-Latency

Achieves real-time streaming audio processing with latency as low as 195 milliseconds

Integration of Speech and Language Modeling

Effectively integrates speech and language modeling capabilities

Multi-Voice Support

Offers millions of pre-built and custom voices, enabling rapid voice switching during conversations

Unified Model for Multiple Tasks

A single model handles diverse audio tasks

Model Capabilities

Speech Recognition

Text-to-Speech

Speech Translation

Voice Dialogue

Audio Understanding

Use Cases

Human-Computer Interaction

Real-Time Voice Dialogue

Enables low-latency natural voice conversations

Latency as low as 195 ms, surpassing average human response time

Speech Processing

Multilingual Speech Translation

Supports speech translation in six languages

🚀 Voila: Voice-Language Foundation Models

Voila is a new family of large voice - language foundation models. It aims to elevate human - AI interaction experiences. By breaking free from the limitations of traditional voice AI systems, such as high latency, loss of vocal nuances, and mechanical responses, Voila uses an innovative end - to - end model design and a novel hierarchical Transformer architecture. This enables real - time, autonomous, and rich voice interactions with extremely low latency.

🚀 Quick Start

CLI demo

for model_name in "maitrix-org/Voila-audio-alpha" "maitrix-org/Voila-base" "maitrix-org/Voila-chat"; do
    # Text chat
    python infer.py \
        --model-name ${model_name} \
	    --instruction "" \
	    --input-text "Hello" \
	    --task-type chat_tito
    # Voice chat
    python infer.py \
        --model-name ${model_name} \
	    --instruction "" \
	    --input-audio "examples/test1.mp3" \
	    --task-type chat_aiao
done

# Autonomous mode
python infer.py \
    --model-name "maitrix-org/Voila-autonomous-preview" \
	--instruction "" \
	--input-audio "examples/test_autonomous1.mp3" \
	--task-type chat_aiao_auto

Gradio demo

python gradio_demo.py

For more information, please refer to the code repository.

✨ Features

High - fidelity, low - latency, real - time streaming audio processing
Effective integration of voice and language modeling capabilities
Millions of pre - built and custom voices, fast voice switching during conversation
Unified model for various audio tasks

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

# CLI demo basic usage
python infer.py \
    --model-name "maitrix-org/Voila-base" \
    --instruction "" \
    --input-text "Hello" \
    --task-type chat_tito

Advanced Usage

# Autonomous mode usage
python infer.py \
    --model-name "maitrix-org/Voila-autonomous-preview" \
    --instruction "" \
    --input-audio "examples/test_autonomous1.mp3" \
    --task-type chat_aiao_auto

📚 Documentation

Foundation Models

Property	Details
Model Type	Voila includes multiple models such as Voila - base, Voila - Chat, Voila - Autonomous (preview), Voila - Audio - alpha, and Voila - Tokenizer.
Download Link	Voila - base: [https://huggingface.co/maitrix-org/Voila - base](https://huggingface.co/maitrix-org/Voila - base) Voila - Chat: [https://huggingface.co/maitrix-org/Voila - chat](https://huggingface.co/maitrix-org/Voila - chat) Voila - Autonomous (preview): [https://huggingface.co/maitrix-org/Voila - autonomous - preview](https://huggingface.co/maitrix-org/Voila - autonomous - preview) Voila - Audio - alpha: [https://huggingface.co/maitrix-org/Voila - audio - alpha](https://huggingface.co/maitrix-org/Voila - audio - alpha) Voila - Tokenizer: [https://huggingface.co/maitrix-org/Voila - Tokenizer](https://huggingface.co/maitrix-org/Voila - Tokenizer)

Property

Details

Model Type

Voila includes multiple models such as Voila - base, Voila - Chat, Voila - Autonomous (preview), Voila - Audio - alpha, and Voila - Tokenizer.

Download Link

Voila - base: [https://huggingface.co/maitrix-org/Voila - base](https://huggingface.co/maitrix-org/Voila - base)
Voila - Chat: [https://huggingface.co/maitrix-org/Voila - chat](https://huggingface.co/maitrix-org/Voila - chat)
Voila - Autonomous (preview): [https://huggingface.co/maitrix-org/Voila - autonomous - preview](https://huggingface.co/maitrix-org/Voila - autonomous - preview)
Voila - Audio - alpha: [https://huggingface.co/maitrix-org/Voila - audio - alpha](https://huggingface.co/maitrix-org/Voila - audio - alpha)
Voila - Tokenizer: [https://huggingface.co/maitrix-org/Voila - Tokenizer](https://huggingface.co/maitrix-org/Voila - Tokenizer)

Datasets

We publish the following two datasets: Voila Benchmark and Voila Voice Library. Voila - Benchmark is a novel speech evaluation benchmark, while Voila Voice Library provides millions of pre - built and customizable voices.

Property	Details
Model Type	Voila - Benchmark and Voila Voice Library
Download Link	Voila - Benchmark: [https://huggingface.co/datasets/maitrix - org/Voila - Benchmark](https://huggingface.co/datasets/maitrix - org/Voila - Benchmark) Voila Voice Library: [https://huggingface.co/datasets/maitrix - org/Voila - million - voice](https://huggingface.co/datasets/maitrix - org/Voila - million - voice)

Benchmark

1. Voila Benchmark

We introduce a novel speech evaluation benchmark called the VoilaBenchmark. The Voila Benchmark is constructed by sampling from five widely used language model evaluation datasets: MMLU, MATH, OpenAI HumanEval, NQ - Open, and GSM8k. We compare our results with SpeechGPT and Moshi.

Model	Voila Benchmark
SpeechGPT	13.29
Moshi	11.45
Voila	30.56

(higher is better)

For detailed scores of Voila Benchmark on each specific domain, please refer to our paper (Section 5.1 "Evaluation of Voila Benchmark").

2. Evaluation of ASR

As Voila supports multiple tasks, including Automatic Speech Recognition (ASR), Text - to - Speech(TTS), and spoken question answering, we also evaluate the performance of ASR and TTS. For ASR, we assess performance on the LibriSpeech test - clean dataset, using Word Error Rate (WER) as our metric. Voila attains a word error rate (WER) of 4.8%, outperforming the 5.7% reported by Moshi. In scenarios where both models utilize LibriSpeech training data, Voila achieves an impressive WER of 2.7%.

Model	LibriSpeech test - clean (WER)
Whisper large v2	2.7
Whisper large v3	2.2
FastConformer	3.6
VoxtLM	2.7
Moshi	5.7
Voila (w/o LibriSpeech train split)	4.8
Voila (with LibriSpeech train split)	2.7

(lower is better)

3. Evaluation of TTS

For TTS, we follow the evaluation metrics proposed in Vall - E, which involves transcribing the generated audio using HuBERT - Large. Voila once again leads with a WER of 3.2% (and 2.8% when using LibriSpeech training data).

Model	LibriSpeech test - clean (WER)
YourTTS	7.7
Vall - E	5.9
Moshi	4.7
Voila (w/o LibriSpeech train split)	3.2
Voila (with LibriSpeech train split)	2.8

(lower is better)

🔧 Technical Details

Voila employs an innovative end - to - end model design and a novel hierarchical Transformer architecture. This approach enables real - time, autonomous, and rich voice interactions, with latency as low as 195 ms, surpassing average human response times.

📄 License

The project uses the MIT license.

📖 Citation

If you find our work helpful, please cite us.

@article{voila2025,
  author    = {Yemin Shi, Yu Shu, Siwei Dong, Guangyi Liu, Jaward Sesay, Jingwen Li, Zhiting Hu},
  title     = {Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Roleplay},
  eprint={2505.02707},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  year      = {2025}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご