KokoroTTS開源文本轉語音模型 - 輕量架構，音質好、速度快還降成本

首頁

Kokorotts

由Daemontatox開發

Kokoro是一款擁有8200萬參數的開源文本轉語音模型，以輕量架構提供媲美大型模型的音質，同時顯著提升速度和成本效益。

語音合成英語開源協議:Apache-2.0 #輕量級TTS #多語言音色 #高性價比合成

下載量 78

發布時間 : 2/27/2025

模型概述

Kokoro是基於StyleTTS2架構的多語言文本轉語音模型，支持8種語言和54種音色，適用於從生產環境到個人項目的各種部署場景。

模型特點

輕量高效

僅8200萬參數的輕量架構，卻能提供與大型模型相媲美的音質

多語言支持

支持8種語言和54種音色，滿足多樣化需求

開源許可

採用Apache-2.0許可，可自由部署於商業和個人項目

低成本訓練

僅需1000美元訓練成本（1000 A100 GPU小時）

模型能力

高質量文本轉語音

多語言語音合成

音色切換

語速調節

使用案例

內容創作

有聲讀物生成

將文字內容轉換為自然語音

支持多種語言和音色選擇

輔助技術

語音輔助應用

為視障用戶提供語音輸出功能

輕量模型適合移動端部署

教育

語言學習工具

生成多語言發音示範

支持8種語言的準確發音

🚀 Kokoro - 輕量級高效文本轉語音模型

Kokoro 是一款擁有 8200 萬個參數的開源權重文本轉語音（TTS）模型。儘管架構輕量，但它能提供與大型模型相媲美的質量，同時速度更快、成本更低。其權重採用 Apache 許可證，可廣泛應用於生產環境和個人項目。

⬆️ Kokoro 已升級到 v1.0 版本！ 查看版本發佈。

🚀 無代碼演示：https://hf.co/spaces/hexgrad/Kokoro-TTS

✨ 現在你可以通過 pip install kokoro 進行安裝！查看使用方法。

版本發佈
使用方法
評估文檔 ↗️
示例音頻 ↗️
語音列表 ↗️
模型信息
訓練詳情
知識共享許可聲明
致謝

🚀 快速開始

你可以通過以下鏈接體驗無代碼演示：Kokoro 無代碼演示。也可以使用 pip 安裝：pip install kokoro。

✨ 主要特性

輕量高效：僅 8200 萬個參數，卻能提供與大型模型相當的語音質量，且速度更快、成本更低。
廣泛適用：採用 Apache 許可證，可用於生產環境和個人項目。
多語言支持：支持多種語言和豐富的語音。

📦 安裝指南

你可以使用以下命令安裝 kokoro 推理庫：

pip install kokoro

💻 使用示例

基礎用法

你可以在 Google Colab 上運行以下代碼：

# 1️⃣ 安裝 kokoro
!pip install -q kokoro>=0.8.2 soundfile
# 2️⃣ 安裝 espeak，用於英語 OOD 回退和一些非英語語言
!apt-get -qq -y install espeak-ng > /dev/null 2>&1
# 🇪🇸 'e' => 西班牙語 es
# 🇫🇷 'f' => 法語 fr-fr
# 🇮🇳 'h' => 印地語 hi
# 🇮🇹 'i' => 意大利語 it
# 🇧🇷 'p' => 巴西葡萄牙語 pt-br

# 3️⃣ 初始化一個管道
from kokoro import KPipeline
from IPython.display import display, Audio
import soundfile as sf
# 🇺🇸 'a' => 美式英語, 🇬🇧 'b' => 英式英語
# 🇯🇵 'j' => 日語: pip install misaki[ja]
# 🇨🇳 'z' => 普通話: pip install misaki[zh]
pipeline = KPipeline(lang_code='a') # <= 確保 lang_code 與語音匹配

# 此文本僅用於演示目的，訓練時未見過
text = '''
The sky above the port was the color of television, tuned to a dead channel.
"It's not like I'm using," Case heard someone say, as he shouldered his way through the crowd around the door of the Chat. "It's like my body's developed this massive drug deficiency."
It was a Sprawl voice and a Sprawl joke. The Chatsubo was a bar for professional expatriates; you could drink there for a week and never hear two words in Japanese.

These were to have an enormous impact, not only because they were associated with Constantine, but also because, as in so many other areas, the decisions taken by Constantine (or in his name) were to have great significance for centuries to come. One of the main issues was the shape that Christian churches were to take, since there was not, apparently, a tradition of monumental church buildings when Constantine decided to help the Christian church build a series of truly spectacular structures. The main form that these churches took was that of the basilica, a multipurpose rectangular structure, based ultimately on the earlier Greek stoa, which could be found in most of the great cities of the empire. Christianity, unlike classical polytheism, needed a large interior space for the celebration of its religious services, and the basilica aptly filled that need. We naturally do not know the degree to which the emperor was involved in the design of new churches, but it is tempting to connect this with the secular basilica that Constantine completed in the Roman forum (the so-called Basilica of Maxentius) and the one he probably built in Trier, in connection with his residence in the city at a time when he was still caesar.

[Kokoro](/kˈOkəɹO/) is an open-weight TTS model with 82 million parameters. Despite its lightweight architecture, it delivers comparable quality to larger models while being significantly faster and more cost-efficient. With Apache-licensed weights, [Kokoro](/kˈOkəɹO/) can be deployed anywhere from production environments to personal projects.
'''
# text = '「もしおれがただ偶然、そしてこうしようというつもりでなくここに立っているのなら、ちょっとばかり絶望するところだな」と、そんなことが彼の頭に思い浮かんだ。'
# text = '中國人民不信邪也不怕邪，不惹事也不怕事，任何外國不要指望我們會拿自己的核心利益做交易，不要指望我們會吞下損害我國主權、安全、發展利益的苦果！'
# text = 'Los partidos políticos tradicionales compiten con los populismos y los movimientos asamblearios.'
# text = 'Le dromadaire resplendissant déambulait tranquillement dans les méandres en mastiquant de petites feuilles vernissées.'
# text = 'ट्रांसपोर्टरों की हड़ताल लगातार पांचवें दिन जारी, दिसंबर से इलेक्ट्रॉनिक टोल कलेक्शनल सिस्टम'
# text = "Allora cominciava l'insonnia, o un dormiveglia peggiore dell'insonnia, che talvolta assumeva i caratteri dell'incubo."
# text = 'Elabora relatórios de acompanhamento cronológico para as diferentes unidades do Departamento que propõem contratos.'

# 4️⃣ 循環生成、展示和保存音頻文件
generator = pipeline(
    text, voice='af_heart', # <= 在此更改語音
    speed=1, split_pattern=r'\n+'
)
for i, (gs, ps, audio) in enumerate(generator):
    print(i)  # i => 索引
    print(gs) # gs => 字符/文本
    print(ps) # ps => 音素
    display(Audio(data=audio, rate=24000, autoplay=i==0))
    sf.write(f'{i}.wav', audio, 24000) # 保存每個音頻文件

高級用法

kokoro 底層使用了 misaki 這個 G2P 庫，你可以根據需要進一步探索其高級功能。

📚 詳細文檔

你可以查看以下文檔獲取更多信息：

🔧 技術細節

模型信息

屬性	詳情
模型架構	StyleTTS 2（https://arxiv.org/abs/2306.07691）、ISTFTNet（https://arxiv.org/abs/2203.02395），僅解碼器，無擴散，無編碼器發佈
架構設計	Li 等人（https://github.com/yl4579/StyleTTS2）
訓練人員	`@rzvzn`（Discord）
支持語言	多種
模型 SHA256 哈希值	`496dba118d1a58f5f3db2efc88dbdc216e0483fc89fe6e47ee1f2c53f18ad1e4`

訓練詳情

訓練數據：Kokoro 僅在許可/無版權音頻數據和 IPA 音素標籤上進行訓練。許可/無版權音頻示例包括：
- 公共領域音頻
- 採用 Apache、MIT 等許可證的音頻
- 大型供應商的閉源 TTS 模型生成的合成音頻^[1]
  [1] https://copyright.gov/ai/ai_policy_guidance.pdf
  注意：不使用開源 TTS 模型或“自定義語音克隆”的合成音頻。
總數據集大小：幾百小時的音頻。
總訓練成本：使用 A100 80GB vRAM 進行 1000 小時訓練，約 1000 美元。