Kokoro Open-Source Text-to-Speech Model - Achieving High Scores with Fewer Parameters for Efficient Voice Conversion

Kokoro

Developed by geneing

Kokoro is a cutting-edge text-to-speech (TTS) model with 82 million parameters, released under Apache 2.0 license. Ranked #1 in TTS Spaces Arena, achieving higher Elo scores with fewer parameters and data.

Speech Synthesis EnglishOpen Source License:Apache-2.0 #Lightweight TTS #High Elo Score #US/UK Bilingual

Downloads 37

Release Time : 1/1/2025

Model Overview

Kokoro is a high-performance text-to-speech model supporting American and British English, capable of generating high-quality voice output.

Model Features

Efficient Parameter Utilization

With 82M parameters and less than 100 hours of training data, it ranks #1 in TTS Spaces Arena, demonstrating exceptional parameter efficiency.

Multi-Voice Support

Offers 10 unique voice packs supporting different vocal styles and accents.

Open-Source License

Released under Apache 2.0 license, allowing free use and modification.

Model Capabilities

Text-to-Speech

Multi-Voice Pack Support

High-Quality Speech Generation

Use Cases

Speech Synthesis

Voice Assistants

Used to generate natural voice responses for voice assistants.

High-quality voice output with near-human pronunciation.

Audiobooks

Converts text content into speech for audiobook production.

Smooth voice output suitable for extended listening.

🚀 Kokoro - 82M Text-to-Speech Model

Kokoro is a cutting - edge TTS model with 82 million parameters that takes text as input and outputs high - quality audio. It offers a remarkable balance between performance and resource utilization, achieving high rankings in evaluations with relatively fewer parameters and less training data.

🚀 Quick Start

You can find a hosted demo at hf.co/spaces/hexgrad/Kokoro-TTS.

✨ Features

High Performance with Fewer Resources: Kokoro v0.19 achieved a high Elo rating in the TTS Spaces Arena, using fewer parameters and less data compared to other models.
Multiple Voicepacks: As of 2 Jan 2025, 10 unique Voicepacks have been released.
ONNX Support: An ONNX version of v0.19 is available.

📦 Installation

The following can be run in a single cell on Google Colab.

# 1️⃣ Install dependencies silently
!git lfs install
!git clone https://huggingface.co/hexgrad/Kokoro-82M
%cd Kokoro-82M
!apt-get -qq -y install espeak-ng > /dev/null 2>&1
!pip install -q phonemizer torch transformers scipy munch

# 2️⃣ Build the model and load the default voicepack
from models import build_model
import torch
device = 'cuda' if torch.cuda.is_available() else 'cpu'
MODEL = build_model('kokoro-v0_19.pth', device)
VOICE_NAME = [
    'af', # Default voice is a 50-50 mix of Bella & Sarah
    'af_bella', 'af_sarah', 'am_adam', 'am_michael',
    'bf_emma', 'bf_isabella', 'bm_george', 'bm_lewis',
    'af_nicole', 'af_sky',
][0]
VOICEPACK = torch.load(f'voices/{VOICE_NAME}.pt', weights_only=True).to(device)
print(f'Loaded voice: {VOICE_NAME}')

# 3️⃣ Call generate, which returns 24khz audio and the phonemes used
from kokoro import generate
text = "How could I know? It's an unanswerable question. Like asking an unborn child if they'll lead a good life. They haven't even been born."
audio, out_ps = generate(MODEL, text, VOICEPACK, lang=VOICE_NAME[0])
# Language is determined by the first letter of the VOICE_NAME:
# 🇺🇸 'a' => American English => en-us
# 🇬🇧 'b' => British English => en-gb

# 4️⃣ Display the 24khz audio and print the output phonemes
from IPython.display import display, Audio
display(Audio(data=audio, rate=24000, autoplay=True))
print(out_ps)

If you have trouble with espeak-ng, see this github issue. Mac users also see this, and Windows users see this.

For ONNX usage, see #14.

💻 Usage Examples

Basic Usage

# 1️⃣ Install dependencies silently
!git lfs install
!git clone https://huggingface.co/hexgrad/Kokoro-82M
%cd Kokoro-82M
!apt-get -qq -y install espeak-ng > /dev/null 2>&1
!pip install -q phonemizer torch transformers scipy munch

# 2️⃣ Build the model and load the default voicepack
from models import build_model
import torch
device = 'cuda' if torch.cuda.is_available() else 'cpu'
MODEL = build_model('kokoro-v0_19.pth', device)
VOICE_NAME = [
    'af', # Default voice is a 50-50 mix of Bella & Sarah
    'af_bella', 'af_sarah', 'am_adam', 'am_michael',
    'bf_emma', 'bf_isabella', 'bm_george', 'bm_lewis',
    'af_nicole', 'af_sky',
][0]
VOICEPACK = torch.load(f'voices/{VOICE_NAME}.pt', weights_only=True).to(device)
print(f'Loaded voice: {VOICE_NAME}')

# 3️⃣ Call generate, which returns 24khz audio and the phonemes used
from kokoro import generate
text = "How could I know? It's an unanswerable question. Like asking an unborn child if they'll lead a good life. They haven't even been born."
audio, out_ps = generate(MODEL, text, VOICEPACK, lang=VOICE_NAME[0])
# Language is determined by the first letter of the VOICE_NAME:
# 🇺🇸 'a' => American English => en-us
# 🇬🇧 'b' => British English => en-gb

# 4️⃣ Display the 24khz audio and print the output phonemes
from IPython.display import display, Audio
display(Audio(data=audio, rate=24000, autoplay=True))
print(out_ps)

Advanced Usage

import torch
bella = torch.load('voices/af_bella.pt', weights_only=True)
sarah = torch.load('voices/af_sarah.pt', weights_only=True)
af = torch.mean(torch.stack([bella, sarah]), dim=0)
assert torch.equal(af, torch.load('voices/af.pt', weights_only=True))

📚 Documentation

Model Facts

Property	Details
Architecture	StyleTTS 2: https://arxiv.org/abs/2306.07691 ISTFTNet: https://arxiv.org/abs/2203.02395 Decoder only: no diffusion, no encoder release
Architected by	Li et al @ https://github.com/yl4579/StyleTTS2
Trained by	`@rzvzn` on Discord
Supported Languages	American English, British English
Model SHA256 Hash	`3b0c392f87508da38fad3a2f9d94c359f1b657ebd2ef79f9d56d69503e470b0a`

Releases

25 Dec 2024: Model v0.19, af_bella, af_sarah
26 Dec 2024: am_adam, am_michael
28 Dec 2024: bf_emma, bf_isabella, bm_george, bm_lewis
30 Dec 2024: af_nicole
31 Dec 2024: af_sky
2 Jan 2025: ONNX v0.19 ebef4245

Licenses

Apache 2.0 weights in this repository
MIT inference code in spaces/hexgrad/Kokoro-TTS adapted from yl4579/StyleTTS2
GPLv3 dependency in espeak-ng

The inference code was originally MIT licensed by the paper author. Note that this card applies only to this model, Kokoro. Original models published by the paper author can be found at hf.co/yl4579.

Evaluation

Metric: Elo rating Leaderboard: hf.co/spaces/Pendrokar/TTS-Spaces-Arena

TTS-Spaces-Arena-25-Dec-2024

The voice ranked in the Arena is a 50-50 mix of Bella and Sarah. For your convenience, this mix is included in this repository as af.pt, but you can trivially reproduce it like this:

import torch
bella = torch.load('voices/af_bella.pt', weights_only=True)
sarah = torch.load('voices/af_sarah.pt', weights_only=True)
af = torch.mean(torch.stack([bella, sarah]), dim=0)
assert torch.equal(af, torch.load('voices/af.pt', weights_only=True))

Training Details

Compute: Kokoro was trained on A100 80GB vRAM instances rented from Vast.ai (referral link). Vast was chosen over other compute providers due to its competitive on-demand hourly rates. The average hourly cost for the A100 80GB vRAM instances used for training was below $1/hr per GPU, which was around half the quoted rates from other providers at the time.

Data: Kokoro was trained exclusively on permissive/non-copyrighted audio data and IPA phoneme labels. Examples of permissive/non-copyrighted audio include:

Public domain audio
Audio licensed under Apache, MIT, etc
Synthetic audio^[1] generated by closed^[2] TTS models from large providers
[1] https://copyright.gov/ai/ai_policy_guidance.pdf
[2] No synthetic audio from open TTS models or "custom voice clones"

Epochs: Less than 20 epochs Total Dataset Size: Less than 100 hours of audio

Limitations

Kokoro v0.19 is limited in some specific ways, due to its training set and/or architecture:

[Data] Lacks voice cloning capability, likely due to small <100h training set
[Arch] Relies on external g2p (espeak-ng), which introduces a class of g2p failure modes
[Data] Training dataset is mostly long-form reading and narration, not conversation
[Arch] At 82M params, Kokoro almost certainly falls to a well-trained 1B+ param diffusion transformer, or a many-billion-param MLLM like GPT-4o / Gemini 2.0 Flash
[Data] Multilingual capability is architecturally feasible, but training data is mostly English

Refer to the Philosophy discussion to better understand these limitations.

Will the other voicepacks be released? There is currently no release date scheduled for the other voicepacks, but in the meantime you can try them in the hosted demo at hf.co/spaces/hexgrad/Kokoro-TTS.

Acknowledgements

@yl4579 for architecting StyleTTS 2
@Pendrokar for adding Kokoro as a contender in the TTS Spaces Arena

Model Card Contact

@rzvzn on Discord. Server invite: https://discord.gg/QuGxSWBfQy

https://terminator.fandom.com/wiki/Kokoro

⚠️ Important Note

Kokoro v0.19 has some limitations due to its training set and/or architecture. Refer to the Limitations section for more details.

💡 Usage Tip

If you have trouble with espeak-ng, see this github issue. Mac users can refer to this, and Windows users can refer to this.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご