Kokoro is an open-weight series of small yet powerful text-to-speech (TTS) models, now featuring data from 100 Chinese speakers sourced from professional datasets.
Kokoro is a text-to-speech (TTS) model series supporting both English and Chinese, characterized by its compact size and powerful performance.
Model Features
Multilingual Support
Supports both English and Chinese, with newly added data from 100 Chinese speakers.
Compact yet Powerful
The model has only 82 million parameters but delivers robust performance.
Open Weights
The model is licensed under Apache, with open weights for broad usage and modification.
Professional Dataset
Chinese data is provided free of charge by the professional dataset company 'Longmao Data,' ensuring high quality.
Model Capabilities
Text-to-Speech
Multilingual voice synthesis
Support for multiple speaker voices
Use Cases
Voice Synthesis
Chinese Voice Synthesis
Utilizes data from 100 Chinese speakers in professional datasets for voice synthesis.
Generates natural and fluent Chinese speech.
English Voice Synthesis
Supports various English accents and speaker voices.
Generates natural and fluent English speech.
🚀 Kokoro - An Open-Weight Series of TTS Models
Kokoro is an open-weight series of small but powerful TTS models. It aims to provide high - quality text - to - speech capabilities with relatively small model sizes.
🐈 GitHub: https://github.com/hexgrad/kokoro
This model is the result of a short training run that added 100 Chinese speakers from a professional dataset. The Chinese data was freely and permissively granted to us by LongMaoData, a professional dataset company. Thank you for making this model possible.
Separately, some crowdsourced synthetic English data also entered the training mix:
1 hour of Maple, an American female.
1 hour of Sol, another American female.
And 1 hour of Vale, an older British female.
This model is not a strict upgrade over its predecessor since it drops many voices, but it is released early to gather feedback on new voices and tokenization. Aside from the Chinese dataset and the 3 hours of English, the rest of the data was left behind for this training run. The goal is to push the model series forward and ultimately restore some of the voices that were left behind.
Current guidance from the U.S. Copyright Office indicates that synthetic data generally does not qualify for copyright protection. Since this synthetic data is crowdsourced, the model trainer is not bound by any Terms of Service. This Apache licensed model also aligns with OpenAI's stated mission of broadly distributing the benefits of AI. If you would like to help further that mission, consider contributing permissive audio data to the cause.
[1] LongMaoData had no involvement in the crowdsourced synthetic English data. [2] The following Chinese text is machine - translated.
TODO: Improve usage. Similar to https://hf.co/hexgrad/Kokoro-82M#usage but you should pass repo_id='hexgrad/Kokoro-82M-v1.1-zh' when constructing a KModel or KPipeline. See make_en.py and make_zh.py.
📚 Documentation
Releases
Property
v0.19
v1.0
v1.1 - zh
Total
Published
2024 Dec 25
2025 Jan 27
2025 Feb 26
-
Training Data
<100 hrs
Few hundred hrs
>100 hours
-
Languages & Voices
1 & 10
8 & 54
2 & 103
-
SHA256
3b0c392f
496dba11
b1d8410f
-
Training in A100 80GB GPU hours
500
500
120
1120
Average Hourly Rate
$0.80/h
$1.20/h
$0.90/h
-
Cost in USD
$400
$600
$110
$1110
Model Facts
Property
Details
Architecture
StyleTTS 2: https://arxiv.org/abs/2306.07691; ISTFTNet: https://arxiv.org/abs/2203.02395; Decoder only: no diffusion, no encoder release; 82 million parameters, same as https://hf.co/hexgrad/Kokoro-82M
Model Architecture: The model is based on StyleTTS 2 and ISTFTNet architectures. StyleTTS 2 provides a framework for generating speech with different styles, and ISTFTNet is used for the inverse short - time Fourier transform in the speech synthesis process.
Training Data: The training data includes Chinese data from LongMaoData and crowdsourced synthetic English data. The combination of these data sources helps the model learn different language patterns and voice characteristics.
📄 License
This project is licensed under the Apache - 2.0 license.
Acknowledgements
TODO: Write acknowledgements. Similar to https://hf.co/hexgrad/Kokoro-82M#acknowledgements