Llama-OuteTTS-1.0-1B Open-Source Multilingual TTS Model - Supports Speech Synthesis and Cloning in 20 Languages

Llama OuteTTS 1.0 1B

Developed by unsloth

OuteTTS 1.0 is a multilingual text-to-speech model based on the Llama architecture, supporting 20 languages with high-quality speech synthesis and voice cloning capabilities.

Speech Synthesis

Safetensors

Supports Multiple Languages#Multilingual speech synthesis #One-shot voice cloning #Automatic text alignment

Downloads 233

Release Time : 5/15/2025

Model Overview

This is a 1B-parameter text-to-speech model that utilizes the DAC audio encoder for high-quality speech synthesis, supporting one-shot voice cloning and automatic text alignment.

Model Features

Multilingual support

Supports text-to-speech in 23 languages, including major European and Asian languages

High-quality speech synthesis

Uses DAC audio encoder for high-fidelity speech output

One-shot voice cloning

Generates accurate voice representations with only about 10 seconds of reference audio

Automatic text alignment

Automatically handles word alignment without requiring text preprocessing

Efficient inference

Runs 1.5x faster with 58% reduced memory usage under the Unsloth framework

Model Capabilities

Text-to-speech

Voice cloning

Multilingual synthesis

Automatic text alignment

High-quality audio generation

Use Cases

Speech synthesis

Audiobook generation

Converts text content into natural speech

High-quality, natural-sounding speech output

Voice assistants

Provides multilingual voice support for virtual assistants

Supports voice interaction in 23 languages

Voice cloning

Personalized speech synthesis

Clones a specific speaker's voice based on a small sample

Generates similar speech with only 10 seconds of audio

🚀 OuteTTS Version 1.0

This update brings significant improvements in speech synthesis and voice cloning, delivering a more powerful, accurate, and user - friendly experience in a compact size.

🔍 See our collection for all TTS model uploads

Explore our TTS models: See our collection for all our TTS model uploads.
Learn TTS fine - tuning: Learn to fine - tune TTS models - Read our Guide.
Unsloth Dynamic 2.0: Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants.

📊 Model Information

Property	Details
Pipeline Tag	text - to - speech
Library Name	outetts
Base Model	OuteAI/Llama - OuteTTS - 1.0 - 1B
License	cc - by - nc - sa - 4.0
Supported Languages	en, ar, zh, nl, fr, de, it, ja, ko, lt, ru, es, pt, be, bn, ka, hu, lv, fa, pl, sw, ta, uk

📈 Model Performance Comparison

Unsloth supports	Free Notebooks	Performance	Memory use
Oute - TTS	👉 Start on Colab	1.5x faster	58% less
Whisper Large V3	👉 Start on Colab	1.5x faster	50% less
Qwen3 (14B)	👉 Start on Colab	2x faster	70% less
Llama 3.2 Vision (11B)	👉 Start on Colab	1.8x faster	50% less

🔗 Oute AI Links

Oute A I

outeai.com Discord @OuteAI

<div class="grid grid-cols-3 sm:grid-cols-3 gap-2">
    <a href="https://huggingface.co/OuteAI/Llama-OuteTTS-1.0-1B" target="_blank" class="bg-white dark:bg-gray-700 text-gray-800 dark:text-gray-100 text-sm font-medium py-2 px-3 rounded-md text-center hover:bg-gray-100 dark:hover:bg-gray-600 hover:border-gray-300 dark:hover:border-gray-500 border border-transparent transition-all">
        Llama OuteTTS 1.0 1B
    </a>
    <a href="https://huggingface.co/OuteAI/Llama-OuteTTS-1.0-1B-GGUF" target="_blank" class="bg-white dark:bg-gray-700 text-gray-800 dark:text-gray-100 text-sm font-medium py-2 px-3 rounded-md text-center hover:bg-gray-100 dark:hover:bg-gray-600 hover:border-gray-300 dark:hover:border-gray-500 border border-transparent transition-all">
        Llama OuteTTS 1.0 1B GGUF
    </a>
    <a href="https://github.com/edwko/OuteTTS" target="_blank" class="bg-white dark:bg-gray-700 text-gray-800 dark:text-gray-100 text-sm font-medium py-2 px-3 rounded-md text-center hover:bg-gray-100 dark:hover:bg-gray-600 hover:border-gray-300 dark:hover:border-gray-500 border border-transparent transition-all">
        GitHub Library
    </a>
</div>

⚠️ Important Note

When using OuteTTS version 1.0, it is crucial to use the settings specified in the Sampling Configuration section.

The repetition penalty implementation is particularly important - this model requires penalization applied to a 64 - token recent window, rather than across the entire context window. Penalizing the entire context will cause the model to produce broken or low - quality output.

Currently, llama.cpp delivers the most reliable and consistent output quality by default. Both llama.cpp and EXL2 support this windowed sampling approach, while Transformers doesn't.

To address this limitation, I've implemented a windowed repetition penalty for the Hugging Face Transformers backend in the OuteTTS library, which significantly improves output quality and resolves sampling issues, providing comparable results to llama.cpp.

✨ Features

1. Prompt Revamp & Dependency Removal

Automatic Word Alignment: The model now performs word alignment internally. Simply input raw text—no pre - processing required—and the model handles the rest, streamlining your workflow. For optimal results, use normalized, readable text without newlines (light normalization is applied automatically in outetts library).
Native Multilingual Text Support: Direct support for native text across multiple languages eliminates the need for romanization.
Enhanced Metadata Integration: The updated prompt system incorporates additional metadata (time, energy, spectral centroid, pitch) at both global and word levels, improving speaker flow and synthesis quality.
Special Tokens for Audio Codebooks: New tokens for c1 (codebook 1) and c2 (codebook 2).

2. New Audio Encoder Model

DAC Encoder: Integrates a DAC audio encoder from [ibm - research/DAC.speech.v1.0](https://huggingface.co/ibm - research/DAC.speech.v1.0), utilizing two codebooks for high - quality audio reconstruction.
Performance Trade - off: Improved audio fidelity increases the token generation rate from 75 to 150 tokens per second. This trade - off prioritizes quality, especially for multilingual applications.

3. Voice Cloning

One - Shot Voice Cloning: To achieve one - shot cloning, the model typically requires only around 10 seconds of reference audio to produce an accurate voice representation.
Improved Accuracy: Enhanced by the new encoder and additional training metadata, voice cloning is now more natural and precise.

4. Auto Text Alignment & Numerical Support

Automatic Text Alignment: Aligns raw text at the word level, even for languages without clear boundaries (e.g., Japanese, Chinese), using insights from pre - processed training data.
Direct Numerical Input: Built - in multilingual numerical support allows direct use of numbers in prompts—no textual conversion needed. (The model typically chooses the dominant language present. Mixing languages in a single prompt may lead to mistakes.)

5. Multilingual Capabilities

Supported Languages: OuteTTS offers varying proficiency levels across languages, based on training data exposure.
High Training Data Languages: These languages feature extensive training: English, Arabic, Chinese, Dutch, French, German, Italian, Japanese, Korean, Lithuanian, Russian, Spanish
Moderate Training Data Languages: These languages received moderate training, offering good performance with occasional limitations: Portuguese, Belarusian, Bengali, Georgian, Hungarian, Latvian, Persian/Farsi, Polish, Swahili, Tamil, Ukrainian
Beyond Supported Languages: The model can generate speech in untrained languages with varying success. Experiment with unlisted languages, though results may not be optimal.

📦 Installation

👉 Installation instructions

💻 Usage Examples

Basic Usage

import outetts

# Initialize the interface
interface = outetts.Interface(
    config=outetts.ModelConfig.auto_config(
        model=outetts.Models.VERSION_1_0_SIZE_1B,
        # For llama.cpp backend
        backend=outetts.Backend.LLAMACPP,
        quantization=outetts.LlamaCppQuantization.FP16
        # For transformers backend
        # backend=outetts.Backend.HF,
    )
)

# Load the default speaker profile
speaker = interface.load_default_speaker("EN-FEMALE-1-NEUTRAL")

# Or create your own speaker profiles in seconds and reuse them instantly
# speaker = interface.create_speaker("path/to/audio.wav")
# interface.save_speaker(speaker, "speaker.json")
# speaker = interface.load_speaker("speaker.json")

# Generate speech
output = interface.generate(
    config=outetts.GenerationConfig(
        text="Hello, how are you doing?",
        generation_type=outetts.GenerationType.CHUNKED,
        speaker=speaker,
        sampler_config=outetts.SamplerConfig(
            temperature=0.4
        ),
    )
)

# Save to file
output.save("output.wav")

Advanced Usage

For advanced settings and customization, visit the official repository:
👉 interface_usage.md

💡 Usage Tip

Speaker Reference: The model is designed to be used with a speaker reference. Without one, it generates random vocal characteristics, often leading to lower - quality outputs. The model inherits the referenced speaker's emotion, style, and accent. When transcribing to other languages with the same speaker, you may observe the model retaining the original accent.

Multilingual Application: It is recommended to create a speaker profile in the language you intend to use. This helps achieve the best results in that specific language, including tone, accent, and linguistic features. While the model supports cross - lingual speech, it still relies on the reference speaker. If the speaker has a distinct accent—such as British English—other languages may carry that accent as well.

Optimal Audio Length:

Best Performance: Generate audio around 42 seconds in a single run (approximately 8,192 tokens). It is recommended not to near the limits of this window when generating. Usually, the best results are up to 7,000 tokens.

📚 Documentation

Video Showcase

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご