KokoroTTS开源文本转语音模型 - 轻量架构，音质好、速度快还降成本

首页

Kokorotts

由 Daemontatox 开发

Kokoro是一款拥有8200万参数的开源文本转语音模型，以轻量架构提供媲美大型模型的音质，同时显著提升速度和成本效益。

语音合成英语开源协议:Apache-2.0 #轻量级TTS #多语言音色 #高性价比合成

下载量 78

发布时间 : 2/27/2025

模型简介

Kokoro是基于StyleTTS2架构的多语言文本转语音模型，支持8种语言和54种音色，适用于从生产环境到个人项目的各种部署场景。

模型特点

轻量高效

仅8200万参数的轻量架构，却能提供与大型模型相媲美的音质

多语言支持

支持8种语言和54种音色，满足多样化需求

开源许可

采用Apache-2.0许可，可自由部署于商业和个人项目

低成本训练

仅需1000美元训练成本（1000 A100 GPU小时）

模型能力

高质量文本转语音

多语言语音合成

音色切换

语速调节

使用案例

内容创作

有声读物生成

将文字内容转换为自然语音

支持多种语言和音色选择

辅助技术

语音辅助应用

为视障用户提供语音输出功能

轻量模型适合移动端部署

教育

语言学习工具

生成多语言发音示范

支持8种语言的准确发音

🚀 Kokoro - 轻量级高效文本转语音模型

Kokoro 是一款拥有 8200 万个参数的开源权重文本转语音（TTS）模型。尽管架构轻量，但它能提供与大型模型相媲美的质量，同时速度更快、成本更低。其权重采用 Apache 许可证，可广泛应用于生产环境和个人项目。

⬆️ Kokoro 已升级到 v1.0 版本！ 查看版本发布。

🚀 无代码演示：https://hf.co/spaces/hexgrad/Kokoro-TTS

✨ 现在你可以通过 pip install kokoro 进行安装！查看使用方法。

版本发布
使用方法
评估文档 ↗️
示例音频 ↗️
语音列表 ↗️
模型信息
训练详情
知识共享许可声明
致谢

🚀 快速开始

你可以通过以下链接体验无代码演示：Kokoro 无代码演示。也可以使用 pip 安装：pip install kokoro。

✨ 主要特性

轻量高效：仅 8200 万个参数，却能提供与大型模型相当的语音质量，且速度更快、成本更低。
广泛适用：采用 Apache 许可证，可用于生产环境和个人项目。
多语言支持：支持多种语言和丰富的语音。

📦 安装指南

你可以使用以下命令安装 kokoro 推理库：

pip install kokoro

💻 使用示例

基础用法

你可以在 Google Colab 上运行以下代码：

# 1️⃣ 安装 kokoro
!pip install -q kokoro>=0.8.2 soundfile
# 2️⃣ 安装 espeak，用于英语 OOD 回退和一些非英语语言
!apt-get -qq -y install espeak-ng > /dev/null 2>&1
# 🇪🇸 'e' => 西班牙语 es
# 🇫🇷 'f' => 法语 fr-fr
# 🇮🇳 'h' => 印地语 hi
# 🇮🇹 'i' => 意大利语 it
# 🇧🇷 'p' => 巴西葡萄牙语 pt-br

# 3️⃣ 初始化一个管道
from kokoro import KPipeline
from IPython.display import display, Audio
import soundfile as sf
# 🇺🇸 'a' => 美式英语, 🇬🇧 'b' => 英式英语
# 🇯🇵 'j' => 日语: pip install misaki[ja]
# 🇨🇳 'z' => 普通话: pip install misaki[zh]
pipeline = KPipeline(lang_code='a') # <= 确保 lang_code 与语音匹配

# 此文本仅用于演示目的，训练时未见过
text = '''
The sky above the port was the color of television, tuned to a dead channel.
"It's not like I'm using," Case heard someone say, as he shouldered his way through the crowd around the door of the Chat. "It's like my body's developed this massive drug deficiency."
It was a Sprawl voice and a Sprawl joke. The Chatsubo was a bar for professional expatriates; you could drink there for a week and never hear two words in Japanese.

These were to have an enormous impact, not only because they were associated with Constantine, but also because, as in so many other areas, the decisions taken by Constantine (or in his name) were to have great significance for centuries to come. One of the main issues was the shape that Christian churches were to take, since there was not, apparently, a tradition of monumental church buildings when Constantine decided to help the Christian church build a series of truly spectacular structures. The main form that these churches took was that of the basilica, a multipurpose rectangular structure, based ultimately on the earlier Greek stoa, which could be found in most of the great cities of the empire. Christianity, unlike classical polytheism, needed a large interior space for the celebration of its religious services, and the basilica aptly filled that need. We naturally do not know the degree to which the emperor was involved in the design of new churches, but it is tempting to connect this with the secular basilica that Constantine completed in the Roman forum (the so-called Basilica of Maxentius) and the one he probably built in Trier, in connection with his residence in the city at a time when he was still caesar.

[Kokoro](/kˈOkəɹO/) is an open-weight TTS model with 82 million parameters. Despite its lightweight architecture, it delivers comparable quality to larger models while being significantly faster and more cost-efficient. With Apache-licensed weights, [Kokoro](/kˈOkəɹO/) can be deployed anywhere from production environments to personal projects.
'''
# text = '「もしおれがただ偶然、そしてこうしようというつもりでなくここに立っているのなら、ちょっとばかり絶望するところだな」と、そんなことが彼の頭に思い浮かんだ。'
# text = '中國人民不信邪也不怕邪，不惹事也不怕事，任何外國不要指望我們會拿自己的核心利益做交易，不要指望我們會吞下損害我國主權、安全、發展利益的苦果！'
# text = 'Los partidos políticos tradicionales compiten con los populismos y los movimientos asamblearios.'
# text = 'Le dromadaire resplendissant déambulait tranquillement dans les méandres en mastiquant de petites feuilles vernissées.'
# text = 'ट्रांसपोर्टरों की हड़ताल लगातार पांचवें दिन जारी, दिसंबर से इलेक्ट्रॉनिक टोल कलेक्शनल सिस्टम'
# text = "Allora cominciava l'insonnia, o un dormiveglia peggiore dell'insonnia, che talvolta assumeva i caratteri dell'incubo."
# text = 'Elabora relatórios de acompanhamento cronológico para as diferentes unidades do Departamento que propõem contratos.'

# 4️⃣ 循环生成、展示和保存音频文件
generator = pipeline(
    text, voice='af_heart', # <= 在此更改语音
    speed=1, split_pattern=r'\n+'
)
for i, (gs, ps, audio) in enumerate(generator):
    print(i)  # i => 索引
    print(gs) # gs => 字符/文本
    print(ps) # ps => 音素
    display(Audio(data=audio, rate=24000, autoplay=i==0))
    sf.write(f'{i}.wav', audio, 24000) # 保存每个音频文件

高级用法

kokoro 底层使用了 misaki 这个 G2P 库，你可以根据需要进一步探索其高级功能。

📚 详细文档

你可以查看以下文档获取更多信息：

🔧 技术细节

模型信息

属性	详情
模型架构	StyleTTS 2（https://arxiv.org/abs/2306.07691）、ISTFTNet（https://arxiv.org/abs/2203.02395），仅解码器，无扩散，无编码器发布
架构设计	Li 等人（https://github.com/yl4579/StyleTTS2）
训练人员	`@rzvzn`（Discord）
支持语言	多种
模型 SHA256 哈希值	`496dba118d1a58f5f3db2efc88dbdc216e0483fc89fe6e47ee1f2c53f18ad1e4`

训练详情

训练数据：Kokoro 仅在许可/无版权音频数据和 IPA 音素标签上进行训练。许可/无版权音频示例包括：
- 公共领域音频
- 采用 Apache、MIT 等许可证的音频
- 大型供应商的闭源 TTS 模型生成的合成音频^[1]
  [1] https://copyright.gov/ai/ai_policy_guidance.pdf
  注意：不使用开源 TTS 模型或“自定义语音克隆”的合成音频。
总数据集大小：几百小时的音频。
总训练成本：使用 A100 80GB vRAM 进行 1000 小时训练，约 1000 美元。