Kokoro-82MオープンソースTTSモデル - 音質は大規模モデルに匹敵し、速度が速く、コストが低く、無料で使用可能

ホーム

Kokoro 82M

prince-canumaによって開発

Kokoroは8200万パラメータのオープンソースTTSモデルで、音質はより大規模なモデルに匹敵しつつ、顕著な速度優位性とコスト効率を備えています。

音声合成英語オープンソースライセンス:Apache-2.0 #軽量TTS #多言語音声合成 #コストパフォーマンスに優れた音声生成

ダウンロード数 376

リリース時間 : 2/26/2025

モデル概要

Kokoroは軽量なテキスト読み上げモデルで、StyleTTS2アーキテクチャを基にし、複数の言語と音色をサポートし、本番環境や個人プロジェクトに適しています。

モデル特徴

軽量で効率的

8200万パラメータの軽量アーキテクチャで、高品質な音質を維持しつつ高速な推論能力を備えています

多言語サポート

8言語と54音色をサポートし、多様なニーズに対応

オープンソースライセンス

Apache-2.0ライセンスを採用し、商用・個人プロジェクトで自由に利用可能

低コストトレーニング

A100 GPUを使用してわずか1000ドルのトレーニングコストで完了

モデル能力

高品質音声合成

多言語音声生成

音色切り替え

話速調整

使用事例

コンテンツ制作

オーディオブック生成

テキストコンテンツを自然な音声に変換

高品質で表現力豊かな音声を生成

動画吹き替え

動画コンテンツに多言語の音声を追加

複数言語と音色をサポートした音声出力

支援技術

音声支援アプリケーション

視覚障害ユーザー向けにテキスト読み上げ機能を提供

クリアで自然な音声出力を生成

🚀 Kokoro

Kokoroは、8200万のパラメータを持つオープンウェイトのテキスト・トゥ・スピーチ（TTS）モデルです。軽量なアーキテクチャでありながら、大規模なモデルと同等の品質を提供し、大幅に高速でコスト効率が高いです。Apacheライセンスのウェイトを持つため、本番環境から個人プロジェクトまで、あらゆる場所で展開できます。

⬆️ Kokoroはv1.0にアップグレードされました！ リリースを参照してください。

✨ 現在、pip install kokoroが可能です！使用方法を参照してください。

リリース
使用方法
SAMPLES.md ↗️
VOICES.md ↗️
モデル情報
学習詳細
クリエイティブ・コモンズ帰属表示
謝辞

🚀 クイックスタート

Kokoroを使い始めるには、まずpipを使ってインストールします。

pip install kokoro

その後、以下のコード例を参考にして、音声合成を行うことができます。

✨ 主な機能

軽量なアーキテクチャで、大規模モデルと同等の品質を提供。
高速でコスト効率が高い。
Apacheライセンスのウェイトで、あらゆる場所での展開が可能。
複数の言語とボイスをサポート。

📦 インストール

pipを使ってKokoroをインストールできます。

pip install kokoro

💻 使用例

基本的な使用法

# 1️⃣ Install kokoro
!pip install -q kokoro>=0.3.4 soundfile
# 2️⃣ Install espeak, used for English OOD fallback and some non-English languages
!apt-get -qq -y install espeak-ng > /dev/null 2>&1
# 🇪🇸 'e' => Spanish es
# 🇫🇷 'f' => French fr-fr
# 🇮🇳 'h' => Hindi hi
# 🇮🇹 'i' => Italian it
# 🇧🇷 'p' => Brazilian Portuguese pt-br

# 3️⃣ Initalize a pipeline
from kokoro import KPipeline
from IPython.display import display, Audio
import soundfile as sf
# 🇺🇸 'a' => American English, 🇬🇧 'b' => British English
# 🇯🇵 'j' => Japanese: pip install misaki[ja]
# 🇨🇳 'z' => Mandarin Chinese: pip install misaki[zh]
pipeline = KPipeline(lang_code='a') # <= make sure lang_code matches voice

# This text is for demonstration purposes only, unseen during training
text = '''
The sky above the port was the color of television, tuned to a dead channel.
"It's not like I'm using," Case heard someone say, as he shouldered his way through the crowd around the door of the Chat. "It's like my body's developed this massive drug deficiency."
It was a Sprawl voice and a Sprawl joke. The Chatsubo was a bar for professional expatriates; you could drink there for a week and never hear two words in Japanese.

These were to have an enormous impact, not only because they were associated with Constantine, but also because, as in so many other areas, the decisions taken by Constantine (or in his name) were to have great significance for centuries to come. One of the main issues was the shape that Christian churches were to take, since there was not, apparently, a tradition of monumental church buildings when Constantine decided to help the Christian church build a series of truly spectacular structures. The main form that these churches took was that of the basilica, a multipurpose rectangular structure, based ultimately on the earlier Greek stoa, which could be found in most of the great cities of the empire. Christianity, unlike classical polytheism, needed a large interior space for the celebration of its religious services, and the basilica aptly filled that need. We naturally do not know the degree to which the emperor was involved in the design of new churches, but it is tempting to connect this with the secular basilica that Constantine completed in the Roman forum (the so-called Basilica of Maxentius) and the one he probably built in Trier, in connection with his residence in the city at a time when he was still caesar.

[Kokoro](/kˈOkəɹO/) is an open-weight TTS model with 82 million parameters. Despite its lightweight architecture, it delivers comparable quality to larger models while being significantly faster and more cost-efficient. With Apache-licensed weights, [Kokoro](/kˈOkəɹO/) can be deployed anywhere from production environments to personal projects.
'''
# text = '「もしおれがただ偶然、そしてこうしようというつもりでなくここに立っているのなら、ちょっとばかり絶望するところだな」と、そんなことが彼の頭に思い浮かんだ。'
# text = '中國人民不信邪也不怕邪，不惹事也不怕事，任何外國不要指望我們會拿自己的核心利益做交易，不要指望我們會吞下損害我國主權、安全、發展利益的苦果！'
# text = 'Los partidos políticos tradicionales compiten con los populismos y los movimientos asamblearios.'
# text = 'Le dromadaire resplendissant déambulait tranquillement dans les méandres en mastiquant de petites feuilles vernissées.'
# text = 'ट्रांसपोर्टरों की हड़ताल लगातार पांचवें दिन जारी, दिसंबर से इलेक्ट्रॉनिक टोल कलेक्शनल सिस्टम'
# text = "Allora cominciava l'insonnia, o un dormiveglia peggiore dell'insonnia, che talvolta assumeva i caratteri dell'incubo."
# text = 'Elabora relatórios de acompanhamento cronológico para as diferentes unidades do Departamento que propõem contratos.'

# 4️⃣ Generate, display, and save audio files in a loop.
generator = pipeline(
    text, voice='af_heart', # <= change voice here
    speed=1, split_pattern=r'\n+'
)
for i, (gs, ps, audio) in enumerate(generator):
    print(i)  # i => index
    print(gs) # gs => graphemes/text
    print(ps) # ps => phonemes
    display(Audio(data=audio, rate=24000, autoplay=i==0))
    sf.write(f'{i}.wav', audio, 24000) # save each audio file

高度な使用法

# 高度なシナリオでは、ボイスや速度、分割パターンなどのパラメータを調整することができます。
# 例えば、異なるボイスや速度を指定して音声合成を行うことができます。
generator = pipeline(
    text, voice='different_voice', # 異なるボイスを指定
    speed=1.5, split_pattern=r'\.\s+' # 異なる分割パターンを指定
)
for i, (gs, ps, audio) in enumerate(generator):
    print(i)  # i => index
    print(gs) # gs => graphemes/text
    print(ps) # ps => phonemes
    display(Audio(data=audio, rate=24000, autoplay=i==0))
    sf.write(f'{i}.wav', audio, 24000) # save each audio file

📚 ドキュメント

リリース

モデル	公開日	学習データ	言語とボイス	SHA256
v0.19	2024年12月25日	<100時間	1言語と10ボイス	`3b0c392f`
v1.0	2025年1月27日	数百時間	8言語と54ボイス	`496dba11`

学習コスト	v0.19	v1.0	合計
A100 80GB GPU時間	500時間	500時間	1000時間
平均時給	$0.80/時間	$1.20/時間	$1/時間
米ドルでの合計	$400	$600	$1000

モデル情報

属性	詳情
モデルタイプ	StyleTTS 2: https://arxiv.org/abs/2306.07691 ISTFTNet: https://arxiv.org/abs/2203.02395 デコーダのみ: 拡散なし、エンコーダなし
アーキテクチャ設計者	Li et al @ https://github.com/yl4579/StyleTTS2
学習者	`@rzvzn` on Discord
サポート言語	アメリカ英語、イギリス英語、フランス語、ヒンディー語
モデルSHA256ハッシュ	`496dba118d1a58f5f3db2efc88dbdc216e0483fc89fe6e47ee1f2c53f18ad1e4`

学習詳細

データ: Kokoroは、許諾可能/著作権のない音声データとIPA音素ラベルのみを使って学習されました。許諾可能/著作権のない音声の例としては、以下のようなものがあります。
- パブリックドメインの音声
- Apache、MITなどのライセンスの音声
- 大規模プロバイダのクローズドTTSモデルによって生成された合成音声^[1]
  [1] https://copyright.gov/ai/ai_policy_guidance.pdf
  [2] オープンTTSモデルや「カスタムボイスクローン」からの合成音声は使用されていません。
総データセットサイズ: 数百時間の音声
総学習コスト: A100 80GB vRAMで1000時間の学習に約$1000