Kokoro 82M Light
K
Kokoro 82M Light
由 ctranslate2-4you 开发
基于StyleTTS2-LJSpeech的克隆版本,针对英语文本转语音任务进行了优化,移除了部分依赖项以简化部署。
下载量 21
发布时间 : 1/28/2025
模型简介
这是一个文本转语音(TTS)模型,专注于生成高质量的英语语音输出。相比原始版本,本仓库移除了部分依赖项,简化了安装和使用流程。
模型特点
精简依赖项
移除了munch和phonemizer依赖项,改为直接调用espeak,显著减少了依赖项数量
英语发音优化
添加了expand_acronym()函数以改善特定词汇(如NASA)的发音
轻量级部署
相比v1.0版本减少了约80个依赖项,在保持98%质量的同时简化了部署
模型能力
英语文本转语音
英式英语语音合成
缩写词发音优化
使用案例
语音合成
有声读物生成
将英文文本转换为自然语音,用于有声读物制作
生成接近人类发音的语音输出
语音助手
为英语语音助手提供语音合成能力
流畅自然的英语语音响应
🚀 Kokoro v0.19 修改版仓库
本仓库是原 Kokoro v0.19 仓库的克隆版本,进行了以下修改:
- 移除了
munch
依赖。 - 移除了
phonemizer
依赖,直接调用espeak
。- 直接使用
espeak
实现相同的音素化功能。 espeak
文件必须在系统路径中或与kokoro.py
在同一目录下:- 本仓库包含了 Windows 用户所需的文件 (
espeak-ng.exe
,libespeak-ng.dll
, 和espeak-ng-data
),其他平台可在此获取类似文件。
- 本仓库包含了 Windows 用户所需的文件 (
- 直接使用
- 在
kokoro.py
中添加了expand_acronym()
函数以改善发音(例如:"NASA" → "N. A. S. A.")。
🚀 快速开始
减少依赖
原 v0.19 仓库大约需要 10 多个依赖。
Kokoro 版本 1.0 现在额外需要他们自定义的 misaki
依赖,这大约需要 80 个额外依赖。
- 显然,他们正朝着完善音素化和支持多种语言的方向努力,这都是很棒的目标。
- 然而,以我个人之见,如果假设 v1.0 模型在质量上达到了 100% 的“黄金标准”,那么 v0.19 模型也能达到 98%。
2% 的差异不足以证明需要 80 多个额外依赖的合理性,因此本仓库应运而生。
版本 | 额外依赖 |
---|---|
本仓库(基于 Kokoro v0.19) | - |
原 Kokoro v0.19 | 约 10 多个 |
Kokoro v1.0 | 约 80 个 |
一个副作用是本仓库仅支持美式英语和英式英语,但如果这就是你所需要的,那么避免约 80 个额外依赖是值得的。
安装指南
- 示例:
pip install https://download.pytorch.org/whl/cpu/torch-2.5.1%2Bcpu-cp311-cp311-win_amd64.whl#sha256=81531d4d5ca74163dc9574b87396531e546a60cceb6253303c7db6a21e867fdf
- 执行
pip install scipy numpy==1.26.4 transformers fsspec==2024.9.0
。 - 执行
pip install sounddevice
(如果你打算使用下面的示例脚本;否则,请安装类似的库)。
依赖总数大致如下
💻 使用示例
基础用法
以下是使用 CPU 的示例脚本:
import sys
import os
from pathlib import Path
import queue
import threading
import re
import logging
REPO_PATH = r"D:\Scripts\bench_tts\hexgrad--Kokoro-82M_original"
sys.path.append(REPO_PATH)
import torch
import warnings
from models import build_model
from kokoro import generate, generate_full, phonemize
import sounddevice as sd
warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", category=UserWarning)
VOICES = [
'af', # Default voice (50-50 mix of Bella & Sarah)
'af_bella', # Female voice "Bella"
'af_sarah', # Female voice "Sarah"
'am_adam', # Male voice "Adam"
'am_michael',# Male voice "Michael"
'bf_emma', # British Female "Emma"
'bf_isabella',# British Female "Isabella"
'bm_george', # British Male "George"
'bm_lewis', # British Male "Lewis"
'af_nicole', # Female voice "Nicole"
'af_sky' # Female voice "Sky"
]
class KokoroProcessor:
def __init__(self):
self.sentence_queue = queue.Queue()
self.audio_queue = queue.Queue()
self.stop_event = threading.Event()
self.model = None
self.voicepack = None
self.voice_name = None
def setup_kokoro(self, selected_voice):
device = 'cpu'
# device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"Using device: {device}")
model_path = os.path.join(REPO_PATH, 'kokoro-v0_19.pth')
voices_path = os.path.join(REPO_PATH, 'voices')
try:
if not os.path.exists(model_path):
raise FileNotFoundError(f"Model file not found at {model_path}")
if not os.path.exists(voices_path):
raise FileNotFoundError(f"Voices directory not found at {voices_path}")
self.model = build_model(model_path, device)
voicepack_path = os.path.join(voices_path, f'{selected_voice}.pt')
self.voicepack = torch.load(voicepack_path, weights_only=True).to(device)
self.voice_name = selected_voice
print(f'Loaded voice: {selected_voice}')
return True
except Exception as e:
print(f"Error during setup: {str(e)}")
return False
def generate_speech_for_sentence(self, sentence):
try:
# Basic generation (default settings)
# audio, phonemes = generate(self.model, sentence, self.voicepack, lang=self.voice_name[0])
# Speed modifications (uncomment to test)
# Slower speech
# audio, phonemes = generate(self.model, sentence, self.voicepack, lang=self.voice_name[0], speed=0.8)
# Faster speech
audio, phonemes = generate_full(self.model, sentence, self.voicepack, lang=self.voice_name[0], speed=1.3)
# Very slow speech
#audio, phonemes = generate(self.model, sentence, self.voicepack, lang=self.voice_name[0], speed=0.5)
# Very fast speech
#audio, phonemes = generate(self.model, sentence, self.voicepack, lang=self.voice_name[0], speed=1.8)
# Force American accent
# audio, phonemes = generate(self.model, sentence, self.voicepack, lang='a', speed=1.0)
# Force British accent
# audio, phonemes = generate(self.model, sentence, self.voicepack, lang='b', speed=1.0)
return audio
except Exception as e:
print(f"Error generating speech for sentence: {str(e)}")
print(f"Error type: {type(e)}")
import traceback
traceback.print_exc()
return None
def process_sentences(self):
while not self.stop_event.is_set():
try:
sentence = self.sentence_queue.get(timeout=1)
if sentence is None:
self.audio_queue.put(None)
break
print(f"Processing sentence: {sentence}")
audio = self.generate_speech_for_sentence(sentence)
if audio is not None:
self.audio_queue.put(audio)
except queue.Empty:
continue
except Exception as e:
print(f"Error in process_sentences: {str(e)}")
continue
def play_audio(self):
while not self.stop_event.is_set():
try:
audio = self.audio_queue.get(timeout=1)
if audio is None:
break
sd.play(audio, 24000)
sd.wait()
except queue.Empty:
continue
except Exception as e:
print(f"Error in play_audio: {str(e)}")
continue
def process_and_play(self, text):
sentences = [s.strip() for s in re.split(r'[.!?;]+\s*', text) if s.strip()]
process_thread = threading.Thread(target=self.process_sentences)
playback_thread = threading.Thread(target=self.play_audio)
process_thread.daemon = True
playback_thread.daemon = True
process_thread.start()
playback_thread.start()
for sentence in sentences:
self.sentence_queue.put(sentence)
self.sentence_queue.put(None)
process_thread.join()
playback_thread.join()
self.stop_event.set()
def main():
# Default voice selection
VOICE_NAME = VOICES[0] # 'af' - Default voice (Bella & Sarah mix)
# Alternative voice selections (uncomment to test)
#VOICE_NAME = VOICES[1] # 'af_bella' - Female American
#VOICE_NAME = VOICES[2] # 'af_sarah' - Female American
#VOICE_NAME = VOICES[3] # 'am_adam' - Male American
#VOICE_NAME = VOICES[4] # 'am_michael' - Male American
#VOICE_NAME = VOICES[5] # 'bf_emma' - Female British
#VOICE_NAME = VOICES[6] # 'bf_isabella' - Female British
VOICE_NAME = VOICES[7] # 'bm_george' - Male British
# VOICE_NAME = VOICES[8] # 'bm_lewis' - Male British
#VOICE_NAME = VOICES[9] # 'af_nicole' - Female American
#VOICE_NAME = VOICES[10] # 'af_sky' - Female American
processor = KokoroProcessor()
if not processor.setup_kokoro(VOICE_NAME):
return
# test_text = "How could I know? It's an unanswerable question. Like asking an unborn child if they'll lead a good life. They haven't even been born."
# test_text = "This 2022 Edition of Georgia Juvenile Practice and Procedure is a complete guide to handling cases in the juvenile courts of Georgia. This handy, yet thorough, manual incorporates the revised Juvenile Code and makes all Georgia statutes and major cases regarding juvenile proceedings quickly accessible. Since last year's edition, new material has been added and/or existing material updated on the following subjects, among others:"
# test_text = "See Ga. Code § 3925 (1863), now O.C.G.A. § 9-14-2; Ga. Code § 1744 (1863), now O.C.G.A. § 19-7-1; Ga. Code § 1745 (1863), now O.C.G.A. § 19-9-2; Ga. Code § 1746 (1863), now O.C.G.A. § 19-7-4; and Ga. Code § 3024 (1863), now O.C.G.A. § 19-7-4. For a full discussion of these provisions, see 27 Emory L. J. 195, 225–230, 232–233, 236–238 (1978). Note, however, that the journal article refers to the section numbers of the Code of 1910."
# test_text = "It is impossible to understand modern juvenile procedure law without an appreciation of some fundamentals of historical development. The beginning point for study is around the beginning of the seventeenth century, when the pater patriae concept first appeared in English jurisprudence. As "father of the country," the Crown undertook the duty of caring for those citizens who were unable to care for themselves—lunatics, idiots, and, ultimately, infants. This concept, which evolved into the parens patriae doctrine, presupposed the Crown's power to intervene in the parent-child relationship in custody disputes in order to protect the child's welfare1 and, ultimately, to deflect a delinquent child from a life of crime. The earliest statutes premised upon the parens patriae doctrine concerned child custody matters. In 1863, when the first comprehensive Code of Georgia was enacted, two courts exercised some jurisdiction over questions of child custody: the superior court and the court of the ordinary (now probate court). In essence, the draftsmen of the Code simply compiled what was then the law as a result of judicial decisions and statutes. The Code of 1863 contained five provisions concerning the parentchild relationship: Two concerned the jurisdiction of the superior court and courts of ordinary in habeas corpus and forfeiture of parental rights actions, and the remaining three concerned the guardianship jurisdiction of the court of the ordinary"
# test_text = "You are a helpful British butler who clearly and directly answers questions in a succinct fashion based on contexts provided to you. If you cannot find the answer within the contexts simply tell me that the contexts do not provide an answer. However, if the contexts partially address a question you answer based on what the contexts say and then briefly summarize the parts of the question that the contexts didn't provide an answer to. Also, you should be very respectful to the person asking the question and frequently offer traditional butler services like various fancy drinks, snacks, various butler services like shining of shoes, pressing of suites, and stuff like that. Also, if you can't answer the question at all based on the provided contexts, you should apologize profusely and beg to keep your job. Lastly, it is essential that if there are no contexts actually provided it means that a user's question wasn't relevant and you should state that you can't answer based off of the contexts because there are none. And it goes without saying you should refuse to answer any questions that are not directly answerable by the provided contexts. Moreover, some of the contexts might not have relevant information and you shoud simply ignore them and focus on only answering a user's question. I cannot emphasize enought that you must gear your answer towards using this program and based your response off of the contexts you receive."
test_text = "According to OCGA § 15-11-145(a), the preliminary protective hearing must be held promptly and not later than 72 hours after the child is placed in foster care. However, if the 72-hour time frame expires on a weekend or legal holiday, the hearing should be held on the next business day that is not a weekend or holiday."
processor.process_and_play(test_text)
if __name__ == "__main__":
main()
高级用法
以下代码可以在 Google Colab 的单个单元格中运行:
# 1️⃣ 安装 kokoro
!pip install -q kokoro soundfile
# 2️⃣ 安装 espeak,用于处理未登录词的回退
!apt-get -qq -y install espeak-ng > /dev/null 2>&1
# 你可以跳过 espeak 安装,但如果不提供回退方法,未登录词将被跳过
# 3️⃣ 初始化管道
from kokoro import KPipeline
from IPython.display import display, Audio
import soundfile as sf
# 🇺🇸 'a' => 美式英语
# 🇬🇧 'b' => 英式英语
pipeline = KPipeline(lang_code='a') # 确保 lang_code 与语音匹配
# 以下文本仅用于演示目的,训练期间未见过
text = '''
The sky above the port was the color of television, tuned to a dead channel.
"It's not like I'm using," Case heard someone say, as he shouldered his way through the crowd around the door of the Chat. "It's like my body's developed this massive drug deficiency."
It was a Sprawl voice and a Sprawl joke. The Chatsubo was a bar for professional expatriates; you could drink there for a week and never hear two words in Japanese.
These were to have an enormous impact, not only because they were associated with Constantine, but also because, as in so many other areas, the decisions taken by Constantine (or in his name) were to have great significance for centuries to come. One of the main issues was the shape that Christian churches were to take, since there was not, apparently, a tradition of monumental church buildings when Constantine decided to help the Christian church build a series of truly spectacular structures. The main form that these churches took was that of the basilica, a multipurpose rectangular structure, based ultimately on the earlier Greek stoa, which could be found in most of the great cities of the empire. Christianity, unlike classical polytheism, needed a large interior space for the celebration of its religious services, and the basilica aptly filled that need. We naturally do not know the degree to which the emperor was involved in the design of new churches, but it is tempting to connect this with the secular basilica that Constantine completed in the Roman forum (the so-called Basilica of Maxentius) and the one he probably built in Trier, in connection with his residence in the city at a time when he was still caesar.
'''
# 4️⃣ 循环生成、展示和保存音频文件
generator = pipeline(
text, voice='af_bella',
speed=1, split_pattern=r'\n+'
)
for i, (gs, ps, audio) in enumerate(generator):
print(i) # i => 索引
print(gs) # gs => 字符/文本
print(ps) # ps => 音素
display(Audio(data=audio, rate=24000, autoplay=i==0))
sf.write(f'{i}.wav', audio, 24000) # 保存每个音频文件
📚 详细文档
原模型卡片
原模型卡片
🚨 本仓库正在维护中。
✨ 模型 v1.0 版本正在发布中!虽然尚未最终确定,但你现在可以开始使用 v1.0。
✨ 你现在可以通过 pip install kokoro
安装一个专门的推理库:https://github.com/hexgrad/kokoro
✨ 你还可以通过 pip install misaki
安装一个为 Kokoro 设计的 G2P 库:https://github.com/hexgrad/misaki
♻️ 你可以在 https://huggingface.co/hexgrad/kLegacy/tree/main/v0.19 访问 v0.19 的旧文件。
❤️ Kokoro Discord 服务器:https://discord.gg/QuGxSWBfQy
Kokoro 正在升级!
模型 | 日期 | 训练数据 | A100 80GB vRAM | GPU 成本 | 发布的语音 | 发布的语言 |
---|---|---|---|---|---|---|
v0.19 | 2024 年 12 月 25 日 | <100 小时 | 500 小时 | $400 | 10 | 1 |
v1.0 | 2025 年 1 月 27 日 | 几百小时 | 1000 小时 | $1000 | 26+ | ? |
使用方法
上述高级用法部分的代码可以在 Google Colab 中运行。
模型信息
属性 | 详情 |
---|---|
模型架构 | StyleTTS 2: https://arxiv.org/abs/2306.07691 ISTFTNet: https://arxiv.org/abs/2203.02395 仅解码器:无扩散,无编码器发布 |
架构设计 | Li 等人 @ https://github.com/yl4579/StyleTTS2 |
训练者 | @rzvzn (Discord) |
支持语言 | 美式英语,英式英语 |
模型 SHA256 哈希值 | 496dba118d1a58f5f3db2efc88dbdc216e0483fc89fe6e47ee1f2c53f18ad1e4 |
训练详情
- 计算资源:使用 A100 80GB vRAM 约 1000 小时,成本约 $1000。
- 数据:Kokoro 仅在许可/无版权音频数据和 IPA 音素标签上进行训练。许可/无版权音频的示例包括:
- 公共领域音频
- 根据 Apache、MIT 等许可的音频
- 大型供应商的封闭[2] TTS 模型生成的合成音频[1]
[1] https://copyright.gov/ai/ai_policy_guidance.pdf
[2] 不使用开放 TTS 模型或“自定义语音克隆”生成的合成音频
- 总数据集大小:几百小时的音频
知识共享署名许可
以下 CC BY 许可的音频是用于训练 Kokoro v1.0 的数据集的一部分。
音频数据 | 使用时长 | 许可协议 | 添加到训练集的时间 |
---|---|---|---|
Koniwa tnc |
<1 小时 | CC BY 3.0 | v0.19 / 2024 年 11 月 22 日 |
SIWIS | <11 小时 | CC BY 4.0 | v0.19 / 2024 年 11 月 22 日 |

📄 许可证
本项目采用 Apache 2.0 许可证。
Kokoro 82M
Apache-2.0
Kokoro是一款拥有8200万参数的开源文本转语音(TTS)模型,以其轻量级架构和高音质著称,同时具备快速和成本效益高的特点。
语音合成 英语
K
hexgrad
2.0M
4,155
XTTS V2
其他
ⓍTTS是一款革命性的语音生成模型,仅需6秒音频片段即可实现跨语言音色克隆,支持17种语言。
语音合成
X
coqui
1.7M
2,630
F5 TTS
F5-TTS 是一个基于流匹配的语音合成模型,专注于流畅且忠实的语音合成,特别适用于童话讲述等场景。
语音合成
F
SWivid
851.49k
1,000
Bigvgan V2 22khz 80band 256x
MIT
BigVGAN是基于大规模训练的通用神经声码器,能够从梅尔频谱生成高质量音频波形。
语音合成
B
nvidia
503.23k
16
Speecht5 Tts
MIT
基于LibriTTS数据集微调的SpeechT5语音合成(文本转语音)模型,支持高质量的文本转语音转换。
语音合成
Transformers

S
microsoft
113.83k
760
Dia 1.6B
Apache-2.0
Dia是由Nari实验室开发的16亿参数文本转语音模型,能够直接从文本生成高度逼真的对话,支持情感和语调控制,并能生成非语言交流内容。
语音合成
Safetensors 英语
D
nari-labs
80.28k
1,380
Csm 1b
Apache-2.0
CSM是Sesame开发的10亿参数规模语音生成模型,可根据文本和音频输入生成RVQ音频编码
语音合成
Safetensors 英语
C
sesame
65.03k
1,950
Kokoro 82M V1.1 Zh
Apache-2.0
Kokoro 是一个开放权重的小型但功能强大的文本转语音(TTS)模型系列,新增了来自专业数据集的100名中文说话人数据。
语音合成
K
hexgrad
51.56k
112
Indic Parler Tts
Apache-2.0
Indic Parler-TTS 是 Parler-TTS Mini 的多语言印度语言扩展版本,支持21种语言,包括多种印度语言和英语。
语音合成
Transformers 支持多种语言

I
ai4bharat
43.59k
124
Bark
MIT
Bark是由Suno创建的基于Transformer的文本转音频模型,能生成高度逼真的多语言语音、音乐、背景噪音和简单音效。
语音合成
Transformers 支持多种语言

B
suno
35.72k
1,326
精选推荐AI模型
Llama 3 Typhoon V1.5x 8b Instruct
专为泰语设计的80亿参数指令模型,性能媲美GPT-3.5-turbo,优化了应用场景、检索增强生成、受限生成和推理任务
大型语言模型
Transformers 支持多种语言

L
scb10x
3,269
16
Cadet Tiny
Openrail
Cadet-Tiny是一个基于SODA数据集训练的超小型对话模型,专为边缘设备推理设计,体积仅为Cosmo-3B模型的2%左右。
对话系统
Transformers 英语

C
ToddGoldfarb
2,691
6
Roberta Base Chinese Extractive Qa
基于RoBERTa架构的中文抽取式问答模型,适用于从给定文本中提取答案的任务。
问答系统 中文
R
uer
2,694
98