Voila-Tokenizer Open-Source Speech-Language Model - Supports Multiple Audio Tasks and Enhances Human-Computer Interaction Experience

Voila Tokenizer

Developed by maitrix-org

Voila is a large-scale voice-language foundation model series designed to enhance human-computer interaction, supporting multiple audio tasks and languages.

Text-to-Audio

Transformers

Supports Multiple LanguagesOpen Source License:MIT #Real-time voice interaction #Multilingual speech synthesis #End-to-end audio processing

Downloads 4,912

Release Time : 2/26/2025

Model Overview

Voila adopts an innovative end-to-end model design and hierarchical Transformer architecture to achieve low-latency, high-fidelity voice interaction, supporting various tasks such as automatic speech recognition (ASR), text-to-speech (TTS), and speech translation.

Model Features

High-fidelity with low latency

Achieves real-time streaming audio processing with latency as low as 195 milliseconds, surpassing the average human reaction time.

Integration of speech and language modeling

Efficiently integrates speech and language modeling capabilities to provide rich interactive experiences.

Multilingual support

Supports automatic speech recognition, text-to-speech, and speech translation in six languages.

Customizable voices

Offers millions of preset and custom voices, allowing quick voice switching during conversations.

Model Capabilities

Automatic speech recognition (ASR)

Text-to-speech (TTS)

Speech translation

Real-time voice interaction

Multilingual support

Use Cases

Voice interaction

Real-time voice chat

Supports low-latency real-time voice conversations, suitable for scenarios like customer service and virtual assistants.

Latency as low as 195 ms, delivering natural and smooth interaction experiences.

Speech synthesis

Multilingual TTS

Supports text-to-speech in six languages, suitable for audiobooks, navigation prompts, and more.

Word error rate (WER) as low as 2.8%, with high speech quality.

Speech recognition

Multilingual ASR

Supports automatic speech recognition in six languages, suitable for meeting minutes, voice transcription, and more.

Word error rate (WER) as low as 2.7%, with high recognition accuracy.

🚀 Voila: Voice-Language Foundation Models

Voila is a new family of large voice - language foundation models. It breaks away from the limitations of traditional voice AI systems, enabling real - time, autonomous, and rich voice interactions. By integrating advanced voice and language modeling, it excels in various audio tasks across six languages, offering a transformative human - AI interaction experience.

✨ Features

High - fidelity, low - latency, real - time streaming audio processing
Effective integration of voice and language modeling capabilities
Millions of pre - built and custom voices, fast voice switching during conversation
Unified model for various audio tasks

📦 Installation

No installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

CLI demo

for model_name in "maitrix-org/Voila-audio-alpha" "maitrix-org/Voila-base" "maitrix-org/Voila-chat"; do
    # Text chat
    python infer.py \
        --model-name ${model_name} \
        --instruction "" \
        --input-text "Hello" \
        --task-type chat_tito
    # Voice chat
    python infer.py \
        --model-name ${model_name} \
        --instruction "" \
        --input-audio "examples/test1.mp3" \
        --task-type chat_aiao
done

# Autonomous mode
python infer.py \
    --model-name "maitrix-org/Voila-autonomous-preview" \
    --instruction "" \
    --input-audio "examples/test_autonomous1.mp3" \
    --task-type chat_aiao_auto

Gradio demo

python gradio_demo.py

For more information, please refer to the code repository.

📚 Documentation

Foundation Models

Property	Details
Model Type	Voila - base, Voila - Chat, Voila - Autonomous (preview), Voila - Audio - alpha, Voila - Tokenizer
Description	Voila base model, End - to - end audio chat model, Full - duplex audio chat model, Empowering LLM with raw audio input, Audio tokenizer
Download Link	Voila - base, Voila - Chat, Voila - Autonomous (preview), Voila - Audio - alpha, Voila - Tokenizer

Datasets

Property	Details
Model Type	Voila Benchmark, Voila Voice Library
Description	Evaluation of Voila Benchmark, Millons of pre - build voices
Download Link	Voila Benchmark, Voila Voice Library

Benchmark

1. Voila Benchmark

We introduce a novel speech evaluation benchmark called the VoilaBenchmark, constructed by sampling from five widely used language model evaluation datasets: MMLU, MATH, OpenAI HumanEval, NQ - Open, and GSM8k. We compare our results with SpeechGPT and Moshi.

Model	Voila Benchmark
SpeechGPT	13.29
Moshi	11.45
Voila	30.56

(higher is better)

For detailed scores of Voila Benchmark on each specific domain, please refer to our paper (Section 5.1 "Evaluation of Voila Benchmark").

2. Evaluation of ASR

As Voila supports multiple tasks, including Automatic Speech Recognition (ASR), Text - to - Speech(TTS), and spoken question answering, we also evaluate the performance of ASR and TTS. For ASR, we assess performance on the LibriSpeech test - clean dataset, using Word Error Rate (WER) as our metric.

Model	LibriSpeech test - clean (WER)
Whisper large v2	2.7
Whisper large v3	2.2
FastConformer	3.6
VoxtLM	2.7
Moshi	5.7
Voila (w/o LibriSpeech train split)	4.8
Voila (with LibriSpeech train split)	2.7

(lower is better)

3. Evaluation of TTS

For TTS, we follow the evaluation metrics proposed in Vall - E, which involves transcribing the generated audio using HuBERT - Large.

Model	LibriSpeech test - clean (WER)
YourTTS	7.7
Vall - E	5.9
Moshi	4.7
Voila (w/o LibriSpeech train split)	3.2
Voila (with LibriSpeech train split)	2.8

(lower is better)

📄 License

This project is licensed under the MIT license.

📚 Citation

If you find our work helpful, please cite us.

@article{voila2025,
  author    = {Yemin Shi, Yu Shu, Siwei Dong, Guangyi Liu, Jaward Sesay, Jingwen Li, Zhiting Hu},
  title     = {Voila: Voice - Language Foundation Models for Real - Time Autonomous Interaction and Voice Roleplay},
  eprint={2505.02707},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  year      = {2025}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご