Zonos-v0.1-transformer Open-source Text-to-Speech Model - Multi-language Voice Conversion Comparable to Top Service Providers

Zonos V0.1 Transformer

Developed by Isi99999

Zonos-v0.1 is a leading open-weight text-to-speech model trained on over 200,000 hours of multilingual speech data, delivering expressiveness and quality comparable to or even surpassing top-tier TTS service providers.

Speech Synthesis

Safetensors

Open Source License:Apache-2.0 #Zero-shot Voice Cloning #Multilingual TTS #Emotion Control

Downloads 30

Release Time : 2/23/2025

Model Overview

Zonos-v0.1 is a text-to-speech model capable of generating highly natural speech from text prompts, supporting voice cloning and emotion control.

Model Features

Zero-shot Voice Cloning

Accurate voice cloning with just a few seconds of reference audio.

Multilingual Support

Supports multiple languages including English, Japanese, Chinese, French, and German.

Emotion Control

Fine-tune speech rate, pitch variation, audio quality, and emotions such as happiness, fear, sadness, and anger.

Efficient Inference

Achieves a real-time factor of 2x speed on an RTX 4090 GPU.

Model Capabilities

Text-to-Speech

Voice Cloning

Emotion Control

Multilingual Support

Use Cases

Speech Synthesis

Voice Assistants

Generate natural speech for voice assistants.

Highly natural speech output.

Audiobooks

Convert text into audiobooks.

High-quality, expressive speech.

Voice Cloning

Personalized Voice

Clone a specific individual's voice.

Accurate reproduction of target voice characteristics.

🚀 Zonos-v0.1

Zonos-v0.1 is a leading open-weight text-to-speech model. It's trained on over 200k hours of diverse multilingual speech, offering expressiveness and quality that can match or even outperform top TTS providers. The model can generate highly natural speech from text prompts with a speaker embedding or audio prefix. It can also accurately clone speech using just a few seconds of reference audio. Moreover, it allows fine control over speaking rate, pitch variation, audio quality, and emotions like happiness, fear, sadness, and anger. The output speech has a native sampling rate of 44kHz.

For more details and speech samples, check out our blog here.
We also have a hosted version available at playground.zyphra.com/audio.

🚀 Quick Start

📦 Installation

At the moment this repository only supports Linux systems (preferably Ubuntu 22.04/24.04) with recent NVIDIA GPUs (3000-series or newer, 6GB+ VRAM).

System dependencies

Zonos depends on the eSpeak library for phonemization. You can install it on Ubuntu with the following command:

apt install -y espeak-ng

Python dependencies

We highly recommend using a recent version of uv for installation. If you don't have uv installed, you can install it via pip: pip install -U uv.

Installing into a new uv virtual environment (recommended)

uv sync
uv sync --extra compile

Installing into the system/actived environment using uv

uv pip install -e .
uv pip install -e .[compile]

Installing into the system/actived environment using pip

pip install -e .
pip install --no-build-isolation -e .[compile]

Confirm that it's working

For convenience, we provide a minimal example to check that the installation works:

uv run sample.py
# python sample.py

Docker installation

git clone https://github.com/Zyphra/Zonos.git
cd Zonos

# For gradio
docker compose up

# Or for development you can do
docker build -t Zonos .
docker run -it --gpus=all --net=host -v /path/to/Zonos:/Zonos -t Zonos
cd /Zonos
python sample.py # this will generate a sample.wav in /Zonos

✨ Features

Zero-shot TTS with voice cloning: Input desired text and a 10 - 30s speaker sample to generate high-quality TTS output.
Audio prefix inputs: Add text plus an audio prefix for even richer speaker matching. Audio prefixes can be used to elicit behaviors such as whispering, which can otherwise be challenging to replicate when cloning from speaker embeddings.
Multilingual support: Zonos-v0.1 supports English, Japanese, Chinese, French, and German.
Audio quality and emotion control: Zonos offers fine-grained control of many aspects of the generated audio, including speaking rate, pitch, maximum frequency, audio quality, and various emotions such as happiness, anger, sadness, and fear.
Fast: Our model runs with a real-time factor of ~2x on an RTX 4090.
Gradio WebUI: Zonos comes packaged with an easy-to-use gradio interface to generate speech.
Simple installation and deployment: Zonos can be installed and deployed simply using the docker file packaged with our repository.

💻 Usage Examples

Basic Usage

Python

import torch
import torchaudio
from zonos.model import Zonos
from zonos.conditioning import make_cond_dict

# model = Zonos.from_pretrained("Zyphra/Zonos-v0.1-hybrid", device="cuda")
model = Zonos.from_pretrained("Zyphra/Zonos-v0.1-transformer", device="cuda")

wav, sampling_rate = torchaudio.load("assets/exampleaudio.mp3")
speaker = model.make_speaker_embedding(wav, sampling_rate)

cond_dict = make_cond_dict(text="Hello, world!", speaker=speaker, language="en-us")
conditioning = model.prepare_conditioning(cond_dict)

codes = model.generate(conditioning)

wavs = model.autoencoder.decode(codes).cpu()
torchaudio.save("sample.wav", wavs[0], model.autoencoder.sampling_rate)

Gradio interface (recommended)

uv run gradio_interface.py
# python gradio_interface.py

This should produce a sample.wav file in your project root directory.

💡 Usage Tip

For repeated sampling, we highly recommend using the gradio interface instead, as the minimal example needs to load the model every time it is run.

🔧 Technical Details

Zonos follows a straightforward architecture: text normalization and phonemization via eSpeak, followed by DAC token prediction through a transformer or hybrid backbone. An overview of the architecture can be seen below.

📄 License

This project is licensed under the Apache-2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご