Kokoro 82M Light
K
Kokoro 82M Light
由ctranslate2-4you開發
基於StyleTTS2-LJSpeech的克隆版本,針對英語文本轉語音任務進行了優化,移除了部分依賴項以簡化部署。
下載量 21
發布時間 : 1/28/2025
模型概述
這是一個文本轉語音(TTS)模型,專注於生成高質量的英語語音輸出。相比原始版本,本倉庫移除了部分依賴項,簡化了安裝和使用流程。
模型特點
精簡依賴項
移除了munch和phonemizer依賴項,改為直接調用espeak,顯著減少了依賴項數量
英語發音優化
添加了expand_acronym()函數以改善特定詞彙(如NASA)的發音
輕量級部署
相比v1.0版本減少了約80個依賴項,在保持98%質量的同時簡化了部署
模型能力
英語文本轉語音
英式英語語音合成
縮寫詞發音優化
使用案例
語音合成
有聲讀物生成
將英文文本轉換為自然語音,用於有聲讀物製作
生成接近人類發音的語音輸出
語音助手
為英語語音助手提供語音合成能力
流暢自然的英語語音響應
🚀 Kokoro v0.19 修改版倉庫
本倉庫是原 Kokoro v0.19 倉庫的克隆版本,進行了以下修改:
- 移除了
munch
依賴。 - 移除了
phonemizer
依賴,直接調用espeak
。- 直接使用
espeak
實現相同的音素化功能。 espeak
文件必須在系統路徑中或與kokoro.py
在同一目錄下:- 本倉庫包含了 Windows 用戶所需的文件 (
espeak-ng.exe
,libespeak-ng.dll
, 和espeak-ng-data
),其他平臺可在此獲取類似文件。
- 本倉庫包含了 Windows 用戶所需的文件 (
- 直接使用
- 在
kokoro.py
中添加了expand_acronym()
函數以改善發音(例如:"NASA" → "N. A. S. A.")。
🚀 快速開始
減少依賴
原 v0.19 倉庫大約需要 10 多個依賴。
Kokoro 版本 1.0 現在額外需要他們自定義的 misaki
依賴,這大約需要 80 個額外依賴。
- 顯然,他們正朝著完善音素化和支持多種語言的方向努力,這都是很棒的目標。
- 然而,以我個人之見,如果假設 v1.0 模型在質量上達到了 100% 的“黃金標準”,那麼 v0.19 模型也能達到 98%。
2% 的差異不足以證明需要 80 多個額外依賴的合理性,因此本倉庫應運而生。
版本 | 額外依賴 |
---|---|
本倉庫(基於 Kokoro v0.19) | - |
原 Kokoro v0.19 | 約 10 多個 |
Kokoro v1.0 | 約 80 個 |
一個副作用是本倉庫僅支持美式英語和英式英語,但如果這就是你所需要的,那麼避免約 80 個額外依賴是值得的。
安裝指南
- 示例:
pip install https://download.pytorch.org/whl/cpu/torch-2.5.1%2Bcpu-cp311-cp311-win_amd64.whl#sha256=81531d4d5ca74163dc9574b87396531e546a60cceb6253303c7db6a21e867fdf
- 執行
pip install scipy numpy==1.26.4 transformers fsspec==2024.9.0
。 - 執行
pip install sounddevice
(如果你打算使用下面的示例腳本;否則,請安裝類似的庫)。
依賴總數大致如下
💻 使用示例
基礎用法
以下是使用 CPU 的示例腳本:
import sys
import os
from pathlib import Path
import queue
import threading
import re
import logging
REPO_PATH = r"D:\Scripts\bench_tts\hexgrad--Kokoro-82M_original"
sys.path.append(REPO_PATH)
import torch
import warnings
from models import build_model
from kokoro import generate, generate_full, phonemize
import sounddevice as sd
warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", category=UserWarning)
VOICES = [
'af', # Default voice (50-50 mix of Bella & Sarah)
'af_bella', # Female voice "Bella"
'af_sarah', # Female voice "Sarah"
'am_adam', # Male voice "Adam"
'am_michael',# Male voice "Michael"
'bf_emma', # British Female "Emma"
'bf_isabella',# British Female "Isabella"
'bm_george', # British Male "George"
'bm_lewis', # British Male "Lewis"
'af_nicole', # Female voice "Nicole"
'af_sky' # Female voice "Sky"
]
class KokoroProcessor:
def __init__(self):
self.sentence_queue = queue.Queue()
self.audio_queue = queue.Queue()
self.stop_event = threading.Event()
self.model = None
self.voicepack = None
self.voice_name = None
def setup_kokoro(self, selected_voice):
device = 'cpu'
# device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"Using device: {device}")
model_path = os.path.join(REPO_PATH, 'kokoro-v0_19.pth')
voices_path = os.path.join(REPO_PATH, 'voices')
try:
if not os.path.exists(model_path):
raise FileNotFoundError(f"Model file not found at {model_path}")
if not os.path.exists(voices_path):
raise FileNotFoundError(f"Voices directory not found at {voices_path}")
self.model = build_model(model_path, device)
voicepack_path = os.path.join(voices_path, f'{selected_voice}.pt')
self.voicepack = torch.load(voicepack_path, weights_only=True).to(device)
self.voice_name = selected_voice
print(f'Loaded voice: {selected_voice}')
return True
except Exception as e:
print(f"Error during setup: {str(e)}")
return False
def generate_speech_for_sentence(self, sentence):
try:
# Basic generation (default settings)
# audio, phonemes = generate(self.model, sentence, self.voicepack, lang=self.voice_name[0])
# Speed modifications (uncomment to test)
# Slower speech
# audio, phonemes = generate(self.model, sentence, self.voicepack, lang=self.voice_name[0], speed=0.8)
# Faster speech
audio, phonemes = generate_full(self.model, sentence, self.voicepack, lang=self.voice_name[0], speed=1.3)
# Very slow speech
#audio, phonemes = generate(self.model, sentence, self.voicepack, lang=self.voice_name[0], speed=0.5)
# Very fast speech
#audio, phonemes = generate(self.model, sentence, self.voicepack, lang=self.voice_name[0], speed=1.8)
# Force American accent
# audio, phonemes = generate(self.model, sentence, self.voicepack, lang='a', speed=1.0)
# Force British accent
# audio, phonemes = generate(self.model, sentence, self.voicepack, lang='b', speed=1.0)
return audio
except Exception as e:
print(f"Error generating speech for sentence: {str(e)}")
print(f"Error type: {type(e)}")
import traceback
traceback.print_exc()
return None
def process_sentences(self):
while not self.stop_event.is_set():
try:
sentence = self.sentence_queue.get(timeout=1)
if sentence is None:
self.audio_queue.put(None)
break
print(f"Processing sentence: {sentence}")
audio = self.generate_speech_for_sentence(sentence)
if audio is not None:
self.audio_queue.put(audio)
except queue.Empty:
continue
except Exception as e:
print(f"Error in process_sentences: {str(e)}")
continue
def play_audio(self):
while not self.stop_event.is_set():
try:
audio = self.audio_queue.get(timeout=1)
if audio is None:
break
sd.play(audio, 24000)
sd.wait()
except queue.Empty:
continue
except Exception as e:
print(f"Error in play_audio: {str(e)}")
continue
def process_and_play(self, text):
sentences = [s.strip() for s in re.split(r'[.!?;]+\s*', text) if s.strip()]
process_thread = threading.Thread(target=self.process_sentences)
playback_thread = threading.Thread(target=self.play_audio)
process_thread.daemon = True
playback_thread.daemon = True
process_thread.start()
playback_thread.start()
for sentence in sentences:
self.sentence_queue.put(sentence)
self.sentence_queue.put(None)
process_thread.join()
playback_thread.join()
self.stop_event.set()
def main():
# Default voice selection
VOICE_NAME = VOICES[0] # 'af' - Default voice (Bella & Sarah mix)
# Alternative voice selections (uncomment to test)
#VOICE_NAME = VOICES[1] # 'af_bella' - Female American
#VOICE_NAME = VOICES[2] # 'af_sarah' - Female American
#VOICE_NAME = VOICES[3] # 'am_adam' - Male American
#VOICE_NAME = VOICES[4] # 'am_michael' - Male American
#VOICE_NAME = VOICES[5] # 'bf_emma' - Female British
#VOICE_NAME = VOICES[6] # 'bf_isabella' - Female British
VOICE_NAME = VOICES[7] # 'bm_george' - Male British
# VOICE_NAME = VOICES[8] # 'bm_lewis' - Male British
#VOICE_NAME = VOICES[9] # 'af_nicole' - Female American
#VOICE_NAME = VOICES[10] # 'af_sky' - Female American
processor = KokoroProcessor()
if not processor.setup_kokoro(VOICE_NAME):
return
# test_text = "How could I know? It's an unanswerable question. Like asking an unborn child if they'll lead a good life. They haven't even been born."
# test_text = "This 2022 Edition of Georgia Juvenile Practice and Procedure is a complete guide to handling cases in the juvenile courts of Georgia. This handy, yet thorough, manual incorporates the revised Juvenile Code and makes all Georgia statutes and major cases regarding juvenile proceedings quickly accessible. Since last year's edition, new material has been added and/or existing material updated on the following subjects, among others:"
# test_text = "See Ga. Code § 3925 (1863), now O.C.G.A. § 9-14-2; Ga. Code § 1744 (1863), now O.C.G.A. § 19-7-1; Ga. Code § 1745 (1863), now O.C.G.A. § 19-9-2; Ga. Code § 1746 (1863), now O.C.G.A. § 19-7-4; and Ga. Code § 3024 (1863), now O.C.G.A. § 19-7-4. For a full discussion of these provisions, see 27 Emory L. J. 195, 225–230, 232–233, 236–238 (1978). Note, however, that the journal article refers to the section numbers of the Code of 1910."
# test_text = "It is impossible to understand modern juvenile procedure law without an appreciation of some fundamentals of historical development. The beginning point for study is around the beginning of the seventeenth century, when the pater patriae concept first appeared in English jurisprudence. As "father of the country," the Crown undertook the duty of caring for those citizens who were unable to care for themselves—lunatics, idiots, and, ultimately, infants. This concept, which evolved into the parens patriae doctrine, presupposed the Crown's power to intervene in the parent-child relationship in custody disputes in order to protect the child's welfare1 and, ultimately, to deflect a delinquent child from a life of crime. The earliest statutes premised upon the parens patriae doctrine concerned child custody matters. In 1863, when the first comprehensive Code of Georgia was enacted, two courts exercised some jurisdiction over questions of child custody: the superior court and the court of the ordinary (now probate court). In essence, the draftsmen of the Code simply compiled what was then the law as a result of judicial decisions and statutes. The Code of 1863 contained five provisions concerning the parentchild relationship: Two concerned the jurisdiction of the superior court and courts of ordinary in habeas corpus and forfeiture of parental rights actions, and the remaining three concerned the guardianship jurisdiction of the court of the ordinary"
# test_text = "You are a helpful British butler who clearly and directly answers questions in a succinct fashion based on contexts provided to you. If you cannot find the answer within the contexts simply tell me that the contexts do not provide an answer. However, if the contexts partially address a question you answer based on what the contexts say and then briefly summarize the parts of the question that the contexts didn't provide an answer to. Also, you should be very respectful to the person asking the question and frequently offer traditional butler services like various fancy drinks, snacks, various butler services like shining of shoes, pressing of suites, and stuff like that. Also, if you can't answer the question at all based on the provided contexts, you should apologize profusely and beg to keep your job. Lastly, it is essential that if there are no contexts actually provided it means that a user's question wasn't relevant and you should state that you can't answer based off of the contexts because there are none. And it goes without saying you should refuse to answer any questions that are not directly answerable by the provided contexts. Moreover, some of the contexts might not have relevant information and you shoud simply ignore them and focus on only answering a user's question. I cannot emphasize enought that you must gear your answer towards using this program and based your response off of the contexts you receive."
test_text = "According to OCGA § 15-11-145(a), the preliminary protective hearing must be held promptly and not later than 72 hours after the child is placed in foster care. However, if the 72-hour time frame expires on a weekend or legal holiday, the hearing should be held on the next business day that is not a weekend or holiday."
processor.process_and_play(test_text)
if __name__ == "__main__":
main()
高級用法
以下代碼可以在 Google Colab 的單個單元格中運行:
# 1️⃣ 安裝 kokoro
!pip install -q kokoro soundfile
# 2️⃣ 安裝 espeak,用於處理未登錄詞的回退
!apt-get -qq -y install espeak-ng > /dev/null 2>&1
# 你可以跳過 espeak 安裝,但如果不提供回退方法,未登錄詞將被跳過
# 3️⃣ 初始化管道
from kokoro import KPipeline
from IPython.display import display, Audio
import soundfile as sf
# 🇺🇸 'a' => 美式英語
# 🇬🇧 'b' => 英式英語
pipeline = KPipeline(lang_code='a') # 確保 lang_code 與語音匹配
# 以下文本僅用於演示目的,訓練期間未見過
text = '''
The sky above the port was the color of television, tuned to a dead channel.
"It's not like I'm using," Case heard someone say, as he shouldered his way through the crowd around the door of the Chat. "It's like my body's developed this massive drug deficiency."
It was a Sprawl voice and a Sprawl joke. The Chatsubo was a bar for professional expatriates; you could drink there for a week and never hear two words in Japanese.
These were to have an enormous impact, not only because they were associated with Constantine, but also because, as in so many other areas, the decisions taken by Constantine (or in his name) were to have great significance for centuries to come. One of the main issues was the shape that Christian churches were to take, since there was not, apparently, a tradition of monumental church buildings when Constantine decided to help the Christian church build a series of truly spectacular structures. The main form that these churches took was that of the basilica, a multipurpose rectangular structure, based ultimately on the earlier Greek stoa, which could be found in most of the great cities of the empire. Christianity, unlike classical polytheism, needed a large interior space for the celebration of its religious services, and the basilica aptly filled that need. We naturally do not know the degree to which the emperor was involved in the design of new churches, but it is tempting to connect this with the secular basilica that Constantine completed in the Roman forum (the so-called Basilica of Maxentius) and the one he probably built in Trier, in connection with his residence in the city at a time when he was still caesar.
'''
# 4️⃣ 循環生成、展示和保存音頻文件
generator = pipeline(
text, voice='af_bella',
speed=1, split_pattern=r'\n+'
)
for i, (gs, ps, audio) in enumerate(generator):
print(i) # i => 索引
print(gs) # gs => 字符/文本
print(ps) # ps => 音素
display(Audio(data=audio, rate=24000, autoplay=i==0))
sf.write(f'{i}.wav', audio, 24000) # 保存每個音頻文件
📚 詳細文檔
原模型卡片
原模型卡片
🚨 本倉庫正在維護中。
✨ 模型 v1.0 版本正在發佈中!雖然尚未最終確定,但你現在可以開始使用 v1.0。
✨ 你現在可以通過 pip install kokoro
安裝一個專門的推理庫:https://github.com/hexgrad/kokoro
✨ 你還可以通過 pip install misaki
安裝一個為 Kokoro 設計的 G2P 庫:https://github.com/hexgrad/misaki
♻️ 你可以在 https://huggingface.co/hexgrad/kLegacy/tree/main/v0.19 訪問 v0.19 的舊文件。
❤️ Kokoro Discord 服務器:https://discord.gg/QuGxSWBfQy
Kokoro 正在升級!
模型 | 日期 | 訓練數據 | A100 80GB vRAM | GPU 成本 | 發佈的語音 | 發佈的語言 |
---|---|---|---|---|---|---|
v0.19 | 2024 年 12 月 25 日 | <100 小時 | 500 小時 | $400 | 10 | 1 |
v1.0 | 2025 年 1 月 27 日 | 幾百小時 | 1000 小時 | $1000 | 26+ | ? |
使用方法
上述高級用法部分的代碼可以在 Google Colab 中運行。
模型信息
屬性 | 詳情 |
---|---|
模型架構 | StyleTTS 2: https://arxiv.org/abs/2306.07691 ISTFTNet: https://arxiv.org/abs/2203.02395 僅解碼器:無擴散,無編碼器發佈 |
架構設計 | Li 等人 @ https://github.com/yl4579/StyleTTS2 |
訓練者 | @rzvzn (Discord) |
支持語言 | 美式英語,英式英語 |
模型 SHA256 哈希值 | 496dba118d1a58f5f3db2efc88dbdc216e0483fc89fe6e47ee1f2c53f18ad1e4 |
訓練詳情
- 計算資源:使用 A100 80GB vRAM 約 1000 小時,成本約 $1000。
- 數據:Kokoro 僅在許可/無版權音頻數據和 IPA 音素標籤上進行訓練。許可/無版權音頻的示例包括:
- 公共領域音頻
- 根據 Apache、MIT 等許可的音頻
- 大型供應商的封閉[2] TTS 模型生成的合成音頻[1]
[1] https://copyright.gov/ai/ai_policy_guidance.pdf
[2] 不使用開放 TTS 模型或“自定義語音克隆”生成的合成音頻
- 總數據集大小:幾百小時的音頻
知識共享署名許可
以下 CC BY 許可的音頻是用於訓練 Kokoro v1.0 的數據集的一部分。
音頻數據 | 使用時長 | 許可協議 | 添加到訓練集的時間 |
---|---|---|---|
Koniwa tnc |
<1 小時 | CC BY 3.0 | v0.19 / 2024 年 11 月 22 日 |
SIWIS | <11 小時 | CC BY 4.0 | v0.19 / 2024 年 11 月 22 日 |

📄 許可證
本項目採用 Apache 2.0 許可證。
Kokoro 82M
Apache-2.0
Kokoro是一款擁有8200萬參數的開源文本轉語音(TTS)模型,以其輕量級架構和高音質著稱,同時具備快速和成本效益高的特點。
語音合成 英語
K
hexgrad
2.0M
4,155
XTTS V2
其他
ⓍTTS是一款革命性的語音生成模型,僅需6秒音頻片段即可實現跨語言音色克隆,支持17種語言。
語音合成
X
coqui
1.7M
2,630
F5 TTS
F5-TTS 是一個基於流匹配的語音合成模型,專注於流暢且忠實的語音合成,特別適用於童話講述等場景。
語音合成
F
SWivid
851.49k
1,000
Bigvgan V2 22khz 80band 256x
MIT
BigVGAN是基於大規模訓練的通用神經聲碼器,能夠從梅爾頻譜生成高質量音頻波形。
語音合成
B
nvidia
503.23k
16
Speecht5 Tts
MIT
基於LibriTTS數據集微調的SpeechT5語音合成(文本轉語音)模型,支持高質量的文本轉語音轉換。
語音合成
Transformers

S
microsoft
113.83k
760
Dia 1.6B
Apache-2.0
Dia是由Nari實驗室開發的16億參數文本轉語音模型,能夠直接從文本生成高度逼真的對話,支持情感和語調控制,並能生成非語言交流內容。
語音合成
Safetensors 英語
D
nari-labs
80.28k
1,380
Csm 1b
Apache-2.0
CSM是Sesame開發的10億參數規模語音生成模型,可根據文本和音頻輸入生成RVQ音頻編碼
語音合成
Safetensors 英語
C
sesame
65.03k
1,950
Kokoro 82M V1.1 Zh
Apache-2.0
Kokoro 是一個開放權重的小型但功能強大的文本轉語音(TTS)模型系列,新增了來自專業數據集的100名中文說話人數據。
語音合成
K
hexgrad
51.56k
112
Indic Parler Tts
Apache-2.0
Indic Parler-TTS 是 Parler-TTS Mini 的多語言印度語言擴展版本,支持21種語言,包括多種印度語言和英語。
語音合成
Transformers 支持多種語言

I
ai4bharat
43.59k
124
Bark
MIT
Bark是由Suno創建的基於Transformer的文本轉音頻模型,能生成高度逼真的多語言語音、音樂、背景噪音和簡單音效。
語音合成
Transformers 支持多種語言

B
suno
35.72k
1,326
精選推薦AI模型
Llama 3 Typhoon V1.5x 8b Instruct
專為泰語設計的80億參數指令模型,性能媲美GPT-3.5-turbo,優化了應用場景、檢索增強生成、受限生成和推理任務
大型語言模型
Transformers 支持多種語言

L
scb10x
3,269
16
Cadet Tiny
Openrail
Cadet-Tiny是一個基於SODA數據集訓練的超小型對話模型,專為邊緣設備推理設計,體積僅為Cosmo-3B模型的2%左右。
對話系統
Transformers 英語

C
ToddGoldfarb
2,691
6
Roberta Base Chinese Extractive Qa
基於RoBERTa架構的中文抽取式問答模型,適用於從給定文本中提取答案的任務。
問答系統 中文
R
uer
2,694
98