Kokoro is an open-source text-to-speech (TTS) model with 82 million parameters, renowned for its lightweight architecture and high audio quality, while also being fast and cost-effective.
Kokoro is an Apache-licensed text-to-speech model capable of generating high-quality speech output, suitable for various scenarios from production environments to personal projects.
Model Features
Lightweight architecture
Despite its smaller parameter size, it delivers audio quality comparable to larger models.
Cost efficiency
Less than $1 per million characters of text input and under $0.06 per hour of audio output.
Multilingual support
Supports 8 languages and 54 voices, suitable for diverse application scenarios.
Open-source license
Licensed under Apache, allowing free deployment in commercial and personal projects.
Model Capabilities
Text-to-speech
Multilingual speech synthesis
Efficient audio generation
Use Cases
Commercial applications
Voice assistants
Provides high-quality speech output for commercial applications.
Efficient and low-cost speech synthesis solution.
Audiobooks
Generates natural and fluent audiobook content.
High-quality multilingual speech output.
Personal projects
Personal voice assistants
Offers customized speech output for personal projects.
Lightweight and easy-to-deploy solution.
🚀 Kokoro
Kokoro is an open-weight TTS model with 82 million parameters. Despite its lightweight architecture, it delivers comparable quality to larger models while being significantly faster and more cost-efficient. With Apache-licensed weights, Kokoro can be deployed anywhere from production environments to personal projects.
!pip install -q kokoro>=0.9.2 soundfile
!apt-get -qq -y install espeak-ng > /dev/null 2>&1from kokoro import KPipeline
from IPython.display import display, Audio
import soundfile as sf
import torch
pipeline = KPipeline(lang_code='a')
text = '''
[Kokoro](/kˈOkəɹO/) is an open-weight TTS model with 82 million parameters. Despite its lightweight architecture, it delivers comparable quality to larger models while being significantly faster and more cost-efficient. With Apache-licensed weights, [Kokoro](/kˈOkəɹO/) can be deployed anywhere from production environments to personal projects.
'''
generator = pipeline(text, voice='af_heart')
for i, (gs, ps, audio) inenumerate(generator):
print(i, gs, ps)
display(Audio(data=audio, rate=24000, autoplay=i==0))
sf.write(f'{i}.wav', audio, 24000)
Under the hood, kokoro uses misaki, a G2P library at https://github.com/hexgrad/misaki
✨ Features
Lightweight and High - Quality: Despite having a lightweight architecture with 82 million parameters, it offers comparable quality to larger models.
Fast and Cost - Efficient: It is significantly faster and more cost - efficient, making it suitable for various scenarios.
Apache - Licensed: With Apache - licensed weights, it can be freely deployed in production environments and personal projects.
Data: Kokoro was trained exclusively on permissive/non - copyrighted audio data and IPA phoneme labels. Examples of permissive/non - copyrighted audio include:
Public domain audio
Audio licensed under Apache, MIT, etc
Synthetic audio[1] generated by closed[2] TTS models from large providers
[1] https://copyright.gov/ai/ai_policy_guidance.pdf
[2] No synthetic audio from open TTS models or "custom voice clones"
Total Dataset Size: A few hundred hours of audio
Total Training Cost: About $1000 for 1000 hours of A100 80GB vRAM
Creative Commons Attribution
The following CC BY audio was part of the dataset used to train Kokoro v1.0.
This is an Apache - licensed model, and Kokoro has been deployed in numerous projects and commercial APIs. We welcome the deployment of the model in real use cases.
⚠️ Caution
Fake websites like kokorottsai_com (snapshot: https://archive.ph/nRRnk) and kokorotts_net (snapshot: https://archive.ph/60opa) are likely scams masquerading under the banner of a popular model.
Any website containing "kokoro" in its root domain (e.g. kokorottsai_com, kokorotts_net) is NOT owned by and NOT affiliated with this model page or its author, and attempts to imply otherwise are red flags.