Dia-1.6B Open-source Text-to-Speech Model - Free Generation of Realistic Dialogues with Support for Emotional Tone Control

Dia 1.6B

Developed by nari-labs

Dia is a 1.6 billion parameter text-to-speech model developed by Nari Labs, capable of generating highly realistic conversations directly from text, supporting emotional and tonal control, and producing non-verbal communication content.

Speech Synthesis

Safetensors

EnglishOpen Source License:Apache-2.0 #Conversational Speech Synthesis #Emotional Tone Control #Non-verbal Communication Generation

Downloads 80.28k

Release Time : 4/20/2025

Model Overview

Dia is an open-weight text-to-dialogue model that supports emotional and tonal control through audio-conditioned output and can generate non-verbal communication content such as laughter and coughing.

Model Features

Highly Realistic Dialogue Generation

Capable of generating highly realistic dialogues directly from text, supporting emotional and tonal control.

Non-verbal Communication Generation

Can generate non-verbal communication content such as laughter, coughing, and throat clearing.

Voice Cloning

Supports voice cloning functionality, allowing users to replicate voices by uploading audio samples.

Open Weights

The model weights are fully open-source, giving users complete control over scripts and speech.

Model Capabilities

Text-to-Speech

Emotional and Tonal Control

Non-verbal Communication Generation

Voice Cloning

Use Cases

Dialogue Generation

Dia Introduction

Generate dialogue content introducing the Dia model

Highly realistic dialogue effects

Emergency Scenarios

Generate dialogue content for emergency situations

Emotionally rich speech output

Voice Cloning

Custom Voice

Clone a specific voice by uploading audio

Generate speech resembling the cloned voice

🚀 Dia - A Text-to-Speech Model

Dia is a 1.6B parameter text to speech model developed by Nari Labs. It directly generates highly realistic dialogue from a transcript, offering full control over scripts and voices. The model can also produce non - verbal communications and allows for emotion and tone control by conditioning the output on audio.

🚀 Quick Start

This will open a Gradio UI that you can work on.

git clone https://github.com/nari-labs/dia.git
cd dia && uv run app.py

or if you do not have uv pre - installed:

git clone https://github.com/nari-labs/dia.git
cd dia
python -m venv .venv
source .venv/bin/activate
pip install uv
uv run app.py

Note that the model was not fine - tuned on a specific voice. Hence, you will get different voices every time you run the model. You can keep speaker consistency by either adding an audio prompt (a guide coming VERY soon - try it with the second example on Gradio for now), or fixing the seed.

✨ Features

Dialogue Generation: Generate dialogue via [S1] and [S2] tag.
Non - verbal Communication: Generate non - verbal like (laughs), (coughs), etc.
- Below verbal tags will be recognized, but might result in unexpected output.
- (laughs), (clears throat), (sighs), (gasps), (coughs), (singing), (sings), (mumbles), (beep), (groans), (sniffs), (claps), (screams), (inhales), (exhales), (applause), (burps), (humming), (sneezes), (chuckle), (whistles)
Voice Cloning: See example/voice_clone.py for more information. In the Hugging Face space, you can upload the audio you want to clone and place its transcript before your script. Make sure the transcript follows the required format. The model will then output only the content of your script.

💻 Usage Examples

Basic Usage

import soundfile as sf

from dia.model import Dia


model = Dia.from_pretrained("nari-labs/Dia-1.6B")

text = "[S1] Dia is an open weights text to dialogue model. [S2] You get full control over scripts and voices. [S1] Wow. Amazing. (laughs) [S2] Try it now on Git hub or Hugging Face."

output = model.generate(text)

sf.write("simple.mp3", output, 44100)

A pypi package and a working CLI tool will be available soon.

🔧 Technical Details

Hardware and Inference Speed

Dia has been tested on only GPUs (pytorch 2.0+, CUDA 12.6). CPU support is to be added soon. The initial run will take longer as the Descript Audio Codec also needs to be downloaded.

On enterprise GPUs, Dia can generate audio in real - time. On older GPUs, inference time will be slower. For reference, on a A4000 GPU, Dia roughly generates 40 tokens/s (86 tokens equals 1 second of audio). torch.compile will increase speeds for supported GPUs.

The full version of Dia requires around 10GB of VRAM to run. We will be adding a quantized version in the future.

If you don't have hardware available or if you want to play with bigger versions of our models, join the waitlist here.

📄 License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

⚠️ Important Note

This project offers a high - fidelity speech generation model intended for research and educational use. The following uses are strictly forbidden:

Identity Misuse: Do not produce audio resembling real individuals without permission.
Deceptive Content: Do not use this model to generate misleading content (e.g. fake news)
Illegal or Malicious Use: Do not use this model for activities that are illegal or intended to cause harm.

By using this model, you agree to uphold relevant legal standards and ethical responsibilities. We are not responsible for any misuse and firmly oppose any unethical usage of this technology.

📚 Documentation

TODO / Future Work

Docker support.
Optimize inference speed.
Add quantization for memory efficiency.

Contributing

We are a tiny team of 1 full - time and 1 part - time research - engineers. We are extra - welcome to any contributions! Join our Discord Server for discussions.

Acknowledgements

We thank the Google TPU Research Cloud program for providing computation resources.
Our work was heavily inspired by SoundStorm, Parakeet, and Descript Audio Codec.
HuggingFace for providing the ZeroGPU Grant.
"Nari" is a pure Korean word for lily.
We thank Jason Y. for providing help with data filtering.

Property	Details
Model Type	A 1.6B parameter text to speech model
Training Data	Not provided
Hosted On	Hugging Face
Supported Languages	English

Widget Examples

Dia intro:
- Text: "[S1] Dia is an open weights text to dialogue model. [S2] You get full control over scripts and voices. [S1] Wow. Amazing. (laughs) [S2] Try it now on Git hub or Hugging Face."
Panic protocol:
- Text: "[S1] Oh fire! Oh my goodness! What's the procedure? What to we do people? The smoke could be coming through an air duct! [S2] Oh my god! Okay.. it's happening. Everybody stay calm! [S1] What's the procedure... [S2] Everybody stay fucking calm!!!... Everybody fucking calm down!!!!! [S1] No! No! If you touch the handle, if its hot there might be a fire down the hallway!"

We also provide a demo page comparing our model to ElevenLabs Studio and Sesame CSM-1B.

(Update) We have a ZeroGPU Space running! Try it now here. Thanks to the HF team for the support :)

Join our discord server for community support and access to new features. Play with a larger version of Dia: generate fun conversations, remix content, and share with friends. 🔮 Join the waitlist for early access.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご