OuteTTS 1.0 Open-source Text-to-Speech Model - Free Multilingual Synthesis and Voice Cloning

Llama OuteTTS 1.0 1B GPTQ 8bit

Developed by adriabama06

OuteTTS 1.0 is a 1B-parameter text-to-speech model supporting multilingual speech synthesis and voice cloning

Speech Synthesis Supports Multiple Languages#Multilingual speech synthesis #10-second voice cloning #DAC high-fidelity encoding

Downloads 15

Release Time : 4/7/2025

Model Overview

A speech synthesis model based on the Llama3.2 architecture, achieving high-fidelity audio reconstruction through DAC encoder, supporting text-to-speech and voice cloning in 17 major languages

Model Features

Native multilingual support

Directly supports text input in 23 languages without preprocessing like romanization conversion

Efficient voice cloning

Generates precise voiceprint clones with just 10 seconds of reference audio

Intelligent text alignment

Automatically handles word alignment for languages without clear boundaries (e.g., Japanese/Chinese)

DAC audio encoder

Utilizes IBM Research's high-fidelity dual-codebook architecture for significantly improved audio quality

Model Capabilities

Text-to-speech synthesis

Cross-language voice conversion

Voice feature cloning

Emotional speech generation

Long-form speech synthesis (up to 42 seconds)

Use Cases

Assistive technology

Accessible reading

Converts text content into speech for visually impaired users

Supports natural speech output in multiple languages

Content creation

Audio content production

Quickly generates podcasts/video voiceovers

Can clone specific host voices

Educational technology

Language learning tool

Generates multilingual pronunciation examples

Supports native pronunciation in 23 languages

🚀 OuteTTS Version 1.0

This update brings significant improvements in speech synthesis and voice cloning, delivering a more powerful, accurate, and user - friendly experience in a compact size.

🚀 Quick Start

Getting started with OuteTTS is simple:

📦 Installation

🔗 Installation instructions

💻 Usage Examples

🔍 Basic Usage

import outetts

# Initialize the interface
interface = outetts.Interface(
    config=outetts.ModelConfig.auto_config(
        model=outetts.Models.VERSION_1_0_SIZE_1B,
        # For llama.cpp backend
        backend=outetts.Backend.LLAMACPP,
        quantization=outetts.LlamaCppQuantization.FP16
        # For transformers backend
        # backend=outetts.Backend.HF,
    )
)

# Load the default speaker profile
speaker = interface.load_default_speaker("EN-FEMALE-1-NEUTRAL")

# Or create your own speaker profiles in seconds and reuse them instantly
# speaker = interface.create_speaker("path/to/audio.wav")
# interface.save_speaker(speaker, "speaker.json")
# speaker = interface.load_speaker("speaker.json")

# Generate speech
output = interface.generate(
    config=outetts.GenerationConfig(
        text="Hello, how are you doing?",
        generation_type=outetts.GenerationType.CHUNKED,
        speaker=speaker,
        sampler_config=outetts.SamplerConfig(
            temperature=0.4
        ),
    )
)

# Save to file
output.save("output.wav")

🔍 Advanced Usage

For advanced settings and customization, visit the official repository:
🔗 interface_usage.md

✨ Features

What's New

1. Prompt Revamp & Dependency Removal

Automatic Word Alignment: The model now performs word alignment internally. Simply input raw text—no pre - processing required—and the model handles the rest, streamlining your workflow. For optimal results, use normalized, readable text without newlines (light normalization is applied automatically in outetts library).
Native Multilingual Text Support: Direct support for native text across multiple languages eliminates the need for romanization.
Enhanced Metadata Integration: The updated prompt system incorporates additional metadata (time, energy, spectral centroid, pitch) at both global and word levels, improving speaker flow and synthesis quality.
Special Tokens for Audio Codebooks: New tokens for c1 (codebook 1) and c2 (codebook 2).

2. New Audio Encoder Model

DAC Encoder: Integrates a DAC audio encoder from [ibm - research/DAC.speech.v1.0](https://huggingface.co/ibm - research/DAC.speech.v1.0), utilizing two codebooks for high - quality audio reconstruction.
Performance Trade - off: Improved audio fidelity increases the token generation rate from 75 to 150 tokens per second. This trade - off prioritizes quality, especially for multilingual applications.

3. Voice Cloning

One - Shot Voice Cloning: To achieve one - shot cloning, the model typically requires only around 10 seconds of reference audio to produce an accurate voice representation.
Improved Accuracy: Enhanced by the new encoder and additional training metadata, voice cloning is now more natural and precise.

4. Auto Text Alignment & Numerical Support

Automatic Text Alignment: Aligns raw text at the word level, even for languages without clear boundaries (e.g., Japanese, Chinese), using insights from pre - processed training data.
Direct Numerical Input: Built - in multilingual numerical support allows direct use of numbers in prompts—no textual conversion needed. (The model typically chooses the dominant language present. Mixing languages in a single prompt may lead to mistakes.)

📚 Documentation

Usage Recommendations

Speaker Reference

⚠️ Important Note

The model is designed to be used with a speaker reference. Without one, it generates random vocal characteristics, often leading to lower - quality outputs. The model inherits the referenced speaker's emotion, style, and accent. When transcribing to other languages with the same speaker, you may observe the model retaining the original accent.

Optimal Audio Length

Best Performance: Generate audio around 42 seconds in a single run (approximately 8,192 tokens). It is recommended not to near the limits of this window when generating. Usually, the best results are up to 7,000 tokens.
Context Reduction with Speaker Reference: If the speaker reference is 10 seconds long, the effective context is reduced to approximately 32 seconds.

Temperature Setting Recommendations

💡 Usage Tip

Testing shows that a temperature of 0.4 is an ideal starting point for accuracy (with the sampling settings below). However, some voice references may benefit from higher temperatures for enhanced expressiveness or slightly lower temperatures for more precise voice replication.

Verifying Speaker Encoding

If the cloned voice quality is subpar, check the encoded speaker sample.

interface.decode_and_save_speaker(speaker=your_speaker, path="speaker.wav")

The DAC audio reconstruction model is lossy, and samples with clipping, excessive loudness, or unusual vocal features may introduce encoding issues that impact output quality.

Sampling Configuration

For optimal results with this TTS model, use the following sampling settings.

Parameter	Value
Temperature	0.4
Repetition Penalty	1.1
Repetition Range	64
Top - k	40
Top - p	0.9
Min - p	0.05

For production or high - quality needs, it is strongly recommended to use llama.cpp for the best results.

Model Specifications

Property	Details
Training Data	Trained on ~60k hours of audio
Context Length	Supports a maximum context window of 8,192 tokens

Training Parameters

Pre - Training

Optimizer: AdamW
Batch Size: 1 million tokens
Max Learning Rate: 3e - 4
Min Learning Rate: 3e - 5
Context Length: 8192

Fine - Tuning

Optimizer: AdamW
Max Learning Rate: 1e - 5
Min Learning Rate: 5e - 6
Data: 10,000 diverse, high - quality examples

Video Showcase

📄 License

Initial Llama3.2 Components: [Llama 3.2 Community License Agreement ](https://huggingface.co/meta - llama/Llama - 3.2 - 1B/blob/main/LICENSE.txt)
Our Continued Pre - Training, Fine - Tuning, and Additional Components: [CC - BY - NC - SA - 4.0](https://creativecommons.org/licenses/by - nc - sa/4.0/)

Acknowledgments

Big thanks to Hugging Face for their continued resource support through their grant program!
Audio encoding and decoding utilize [ibm - research/DAC.speech.v1.0](https://huggingface.co/ibm - research/DAC.speech.v1.0)
OuteTTS is built using [Llama3.2 - 1B](https://huggingface.co/meta - llama/Llama - 3.2 - 1B) as the base model, with continued pre - training and fine - tuning.

Ethical Use Guidelines

This text - to - speech model is intended for legitimate applications that enhance accessibility, creativity, and communication; prohibited uses include impersonation without consent, creation of deliberately misleading content, generation of harmful or harassing material, distribution of synthetic audio without proper disclosure, voice cloning without permission, and any uses that violate applicable laws, regulations, or copyrights.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご