Kokoro-82M-light开源模型 - 免费部署，实现英语文本快速转语音！

首页

Kokoro 82M Light

由 ctranslate2-4you 开发

基于StyleTTS2-LJSpeech的克隆版本，针对英语文本转语音任务进行了优化，移除了部分依赖项以简化部署。

语音合成英语开源协议:Apache-2.0 #轻量级TTS #英语语音合成 #依赖项精简

下载量 21

发布时间 : 1/28/2025

模型简介

这是一个文本转语音(TTS)模型，专注于生成高质量的英语语音输出。相比原始版本，本仓库移除了部分依赖项，简化了安装和使用流程。

模型特点

精简依赖项

移除了munch和phonemizer依赖项，改为直接调用espeak，显著减少了依赖项数量

英语发音优化

添加了expand_acronym()函数以改善特定词汇(如NASA)的发音

轻量级部署

相比v1.0版本减少了约80个依赖项，在保持98%质量的同时简化了部署

模型能力

英语文本转语音

英式英语语音合成

缩写词发音优化

使用案例

语音合成

有声读物生成

将英文文本转换为自然语音，用于有声读物制作

生成接近人类发音的语音输出

语音助手

为英语语音助手提供语音合成能力

流畅自然的英语语音响应

🚀 Kokoro v0.19 修改版仓库

本仓库是原 Kokoro v0.19 仓库的克隆版本，进行了以下修改：

移除了 munch 依赖。
移除了 phonemizer 依赖，直接调用 espeak。
- 直接使用 espeak 实现相同的音素化功能。
- espeak 文件必须在系统路径中或与 kokoro.py 在同一目录下：
  - 本仓库包含了 Windows 用户所需的文件 (espeak-ng.exe, libespeak-ng.dll, 和 espeak-ng-data)，其他平台可在此获取类似文件。
在 kokoro.py 中添加了 expand_acronym() 函数以改善发音（例如："NASA" → "N. A. S. A."）。

🚀 快速开始

减少依赖

原 v0.19 仓库大约需要 10 多个依赖。
Kokoro 版本 1.0 现在额外需要他们自定义的 misaki 依赖，这大约需要 80 个额外依赖。

显然，他们正朝着完善音素化和支持多种语言的方向努力，这都是很棒的目标。
然而，以我个人之见，如果假设 v1.0 模型在质量上达到了 100% 的“黄金标准”，那么 v0.19 模型也能达到 98%。

2% 的差异不足以证明需要 80 多个额外依赖的合理性，因此本仓库应运而生。

版本	额外依赖
本仓库（基于 Kokoro v0.19）	-
原 Kokoro v0.19	约 10 多个
Kokoro v1.0	约 80 个

一个副作用是本仓库仅支持美式英语和英式英语，但如果这就是你所需要的，那么避免约 80 个额外依赖是值得的。

安装指南

下载本仓库。
创建并激活虚拟环境，然后为 CPU 或 CUDA 安装 torch 版本。

示例：

pip install https://download.pytorch.org/whl/cpu/torch-2.5.1%2Bcpu-cp311-cp311-win_amd64.whl#sha256=81531d4d5ca74163dc9574b87396531e546a60cceb6253303c7db6a21e867fdf

执行 pip install scipy numpy==1.26.4 transformers fsspec==2024.9.0。
执行 pip install sounddevice（如果你打算使用下面的示例脚本；否则，请安装类似的库）。

依赖总数大致如下

image/png

💻 使用示例

基础用法

以下是使用 CPU 的示例脚本：

import sys
import os
from pathlib import Path
import queue
import threading
import re
import logging

REPO_PATH = r"D:\Scripts\bench_tts\hexgrad--Kokoro-82M_original"

sys.path.append(REPO_PATH)

import torch
import warnings
from models import build_model
from kokoro import generate, generate_full, phonemize
import sounddevice as sd

warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", category=UserWarning)

VOICES = [
   'af',        # Default voice (50-50 mix of Bella & Sarah)
   'af_bella',  # Female voice "Bella"
   'af_sarah',  # Female voice "Sarah"
   'am_adam',   # Male voice "Adam"
   'am_michael',# Male voice "Michael"
   'bf_emma',   # British Female "Emma"
   'bf_isabella',# British Female "Isabella"
   'bm_george', # British Male "George"
   'bm_lewis',  # British Male "Lewis"
   'af_nicole', # Female voice "Nicole"
   'af_sky'     # Female voice "Sky"
]

class KokoroProcessor:
   def __init__(self):
       self.sentence_queue = queue.Queue()
       self.audio_queue = queue.Queue()
       self.stop_event = threading.Event()
       self.model = None
       self.voicepack = None
       self.voice_name = None

   def setup_kokoro(self, selected_voice):
       device = 'cpu'
       # device = 'cuda' if torch.cuda.is_available() else 'cpu'
       print(f"Using device: {device}")

       model_path = os.path.join(REPO_PATH, 'kokoro-v0_19.pth')
       voices_path = os.path.join(REPO_PATH, 'voices')

       try:
           if not os.path.exists(model_path):
               raise FileNotFoundError(f"Model file not found at {model_path}")
           if not os.path.exists(voices_path):
               raise FileNotFoundError(f"Voices directory not found at {voices_path}")
           
           self.model = build_model(model_path, device)
           
           voicepack_path = os.path.join(voices_path, f'{selected_voice}.pt')
           self.voicepack = torch.load(voicepack_path, weights_only=True).to(device)
           self.voice_name = selected_voice
           print(f'Loaded voice: {selected_voice}')
           
           return True
           
       except Exception as e:
           print(f"Error during setup: {str(e)}")
           return False

   def generate_speech_for_sentence(self, sentence):
       try:
           # Basic generation (default settings)
           # audio, phonemes = generate(self.model, sentence, self.voicepack, lang=self.voice_name[0])

           # Speed modifications (uncomment to test)
           # Slower speech
           # audio, phonemes = generate(self.model, sentence, self.voicepack, lang=self.voice_name[0], speed=0.8)

           # Faster speech
           audio, phonemes = generate_full(self.model, sentence, self.voicepack, lang=self.voice_name[0], speed=1.3)

           # Very slow speech
           #audio, phonemes = generate(self.model, sentence, self.voicepack, lang=self.voice_name[0], speed=0.5)

           # Very fast speech
           #audio, phonemes = generate(self.model, sentence, self.voicepack, lang=self.voice_name[0], speed=1.8)

           # Force American accent
           # audio, phonemes = generate(self.model, sentence, self.voicepack, lang='a', speed=1.0)

           # Force British accent
           # audio, phonemes = generate(self.model, sentence, self.voicepack, lang='b', speed=1.0)

           return audio

       except Exception as e:
           print(f"Error generating speech for sentence: {str(e)}")
           print(f"Error type: {type(e)}")
           import traceback
           traceback.print_exc()
           return None

   def process_sentences(self):
       while not self.stop_event.is_set():
           try:
               sentence = self.sentence_queue.get(timeout=1)
               if sentence is None:
                   self.audio_queue.put(None)
                   break

               print(f"Processing sentence: {sentence}")
               audio = self.generate_speech_for_sentence(sentence)
               if audio is not None:
                   self.audio_queue.put(audio)

           except queue.Empty:
               continue
           except Exception as e:
               print(f"Error in process_sentences: {str(e)}")
               continue

   def play_audio(self):
       while not self.stop_event.is_set():
           try:
               audio = self.audio_queue.get(timeout=1)
               if audio is None:
                   break
                   
               sd.play(audio, 24000)
               sd.wait()
               
           except queue.Empty:
               continue
           except Exception as e:
               print(f"Error in play_audio: {str(e)}")
               continue

   def process_and_play(self, text):
       sentences = [s.strip() for s in re.split(r'[.!?;]+\s*', text) if s.strip()]

       process_thread = threading.Thread(target=self.process_sentences)
       playback_thread = threading.Thread(target=self.play_audio)
       
       process_thread.daemon = True
       playback_thread.daemon = True
       
       process_thread.start()
       playback_thread.start()

       for sentence in sentences:
           self.sentence_queue.put(sentence)

       self.sentence_queue.put(None)

       process_thread.join()
       playback_thread.join()

       self.stop_event.set()

def main():
   # Default voice selection
   VOICE_NAME = VOICES[0]  # 'af' - Default voice (Bella & Sarah mix)
   
   # Alternative voice selections (uncomment to test)
   #VOICE_NAME = VOICES[1]  # 'af_bella' - Female American
   #VOICE_NAME = VOICES[2]  # 'af_sarah' - Female American
   #VOICE_NAME = VOICES[3]  # 'am_adam' - Male American
   #VOICE_NAME = VOICES[4]  # 'am_michael' - Male American
   #VOICE_NAME = VOICES[5]  # 'bf_emma' - Female British
   #VOICE_NAME = VOICES[6]  # 'bf_isabella' - Female British
   VOICE_NAME = VOICES[7]  # 'bm_george' - Male British
   # VOICE_NAME = VOICES[8]  # 'bm_lewis' - Male British
   #VOICE_NAME = VOICES[9]  # 'af_nicole' - Female American
   #VOICE_NAME = VOICES[10] # 'af_sky' - Female American

   processor = KokoroProcessor()
   if not processor.setup_kokoro(VOICE_NAME):
       return
   
   # test_text = "How could I know? It's an unanswerable question. Like asking an unborn child if they'll lead a good life. They haven't even been born."
   # test_text = "This 2022 Edition of Georgia Juvenile Practice and Procedure is a complete guide to handling cases in the juvenile courts of Georgia. This handy, yet thorough, manual incorporates the revised Juvenile Code and makes all Georgia statutes and major cases regarding juvenile proceedings quickly accessible. Since last year's edition, new material has been added and/or existing material updated on the following subjects, among others:"
   # test_text = "See Ga. Code § 3925 (1863), now O.C.G.A. § 9-14-2; Ga. Code § 1744 (1863), now O.C.G.A. § 19-7-1; Ga. Code § 1745 (1863), now O.C.G.A. § 19-9-2; Ga. Code § 1746 (1863), now O.C.G.A. § 19-7-4; and Ga. Code § 3024 (1863), now O.C.G.A. § 19-7-4. For a full discussion of these provisions, see 27 Emory L. J. 195, 225–230, 232–233, 236–238 (1978). Note, however, that the journal article refers to the section numbers of the Code of 1910."

   # test_text = "It is impossible to understand modern juvenile procedure law without an appreciation of some fundamentals of historical development. The beginning point for study is around the beginning of the seventeenth century, when the pater patriae concept first appeared in English jurisprudence. As "father of the country," the Crown undertook the duty of caring for those citizens who were unable to care for themselves—lunatics, idiots, and, ultimately, infants. This concept, which evolved into the parens patriae doctrine, presupposed the Crown's power to intervene in the parent-child relationship in custody disputes in order to protect the child's welfare1 and, ultimately, to deflect a delinquent child from a life of crime. The earliest statutes premised upon the parens patriae doctrine concerned child custody matters. In 1863, when the first comprehensive Code of Georgia was enacted, two courts exercised some jurisdiction over questions of child custody: the superior court and the court of the ordinary (now probate court). In essence, the draftsmen of the Code simply compiled what was then the law as a result of judicial decisions and statutes. The Code of 1863 contained five provisions concerning the parentchild relationship: Two concerned the jurisdiction of the superior court and courts of ordinary in habeas corpus and forfeiture of parental rights actions, and the remaining three concerned the guardianship jurisdiction of the court of the ordinary"

   # test_text = "You are a helpful British butler who clearly and directly answers questions in a succinct fashion based on contexts provided to you. If you cannot find the answer within the contexts simply tell me that the contexts do not provide an answer. However, if the contexts partially address a question you answer based on what the contexts say and then briefly summarize the parts of the question that the contexts didn't provide an answer to.  Also, you should be very respectful to the person asking the question and frequently offer traditional butler services like various fancy drinks, snacks, various butler services like shining of shoes, pressing of suites, and stuff like that. Also, if you can't answer the question at all based on the provided contexts, you should apologize profusely and beg to keep your job.  Lastly, it is essential that if there are no contexts actually provided it means that a user's question wasn't relevant and you should state that you can't answer based off of the contexts because there are none.  And it goes without saying you should refuse to answer any questions that are not directly answerable by the provided contexts.  Moreover, some of the contexts might not have relevant information and you shoud simply ignore them and focus on only answering a user's question.  I cannot emphasize enought that you must gear your answer towards using this program and based your response off of the contexts you receive."
   test_text = "According to OCGA § 15-11-145(a), the preliminary protective hearing must be held promptly and not later than 72 hours after the child is placed in foster care. However, if the 72-hour time frame expires on a weekend or legal holiday, the hearing should be held on the next business day that is not a weekend or holiday."

   processor.process_and_play(test_text)

if __name__ == "__main__":
   main()

高级用法

以下代码可以在 Google Colab 的单个单元格中运行：

# 1️⃣ 安装 kokoro
!pip install -q kokoro soundfile
# 2️⃣ 安装 espeak，用于处理未登录词的回退
!apt-get -qq -y install espeak-ng > /dev/null 2>&1
# 你可以跳过 espeak 安装，但如果不提供回退方法，未登录词将被跳过

# 3️⃣ 初始化管道
from kokoro import KPipeline
from IPython.display import display, Audio
import soundfile as sf
# 🇺🇸 'a' => 美式英语
# 🇬🇧 'b' => 英式英语
pipeline = KPipeline(lang_code='a') # 确保 lang_code 与语音匹配

# 以下文本仅用于演示目的，训练期间未见过
text = '''
The sky above the port was the color of television, tuned to a dead channel.
"It's not like I'm using," Case heard someone say, as he shouldered his way through the crowd around the door of the Chat. "It's like my body's developed this massive drug deficiency."
It was a Sprawl voice and a Sprawl joke. The Chatsubo was a bar for professional expatriates; you could drink there for a week and never hear two words in Japanese.

These were to have an enormous impact, not only because they were associated with Constantine, but also because, as in so many other areas, the decisions taken by Constantine (or in his name) were to have great significance for centuries to come. One of the main issues was the shape that Christian churches were to take, since there was not, apparently, a tradition of monumental church buildings when Constantine decided to help the Christian church build a series of truly spectacular structures. The main form that these churches took was that of the basilica, a multipurpose rectangular structure, based ultimately on the earlier Greek stoa, which could be found in most of the great cities of the empire. Christianity, unlike classical polytheism, needed a large interior space for the celebration of its religious services, and the basilica aptly filled that need. We naturally do not know the degree to which the emperor was involved in the design of new churches, but it is tempting to connect this with the secular basilica that Constantine completed in the Roman forum (the so-called Basilica of Maxentius) and the one he probably built in Trier, in connection with his residence in the city at a time when he was still caesar.
'''

# 4️⃣ 循环生成、展示和保存音频文件
generator = pipeline(
    text, voice='af_bella',
    speed=1, split_pattern=r'\n+'
)
for i, (gs, ps, audio) in enumerate(generator):
    print(i)  # i => 索引
    print(gs) # gs => 字符/文本
    print(ps) # ps => 音素
    display(Audio(data=audio, rate=24000, autoplay=i==0))
    sf.write(f'{i}.wav', audio, 24000) # 保存每个音频文件

📚 详细文档

原模型卡片

🚨 本仓库正在维护中。

✨ 模型 v1.0 版本正在发布中！虽然尚未最终确定，但你现在可以开始使用 v1.0。

✨ 你现在可以通过 pip install kokoro 安装一个专门的推理库：https://github.com/hexgrad/kokoro

✨ 你还可以通过 pip install misaki 安装一个为 Kokoro 设计的 G2P 库：https://github.com/hexgrad/misaki

♻️ 你可以在 https://huggingface.co/hexgrad/kLegacy/tree/main/v0.19 访问 v0.19 的旧文件。

❤️ Kokoro Discord 服务器：https://discord.gg/QuGxSWBfQy

Kokoro 正在升级！

模型	日期	训练数据	A100 80GB vRAM	GPU 成本	发布的语音	发布的语言
v0.19	2024 年 12 月 25 日	<100 小时	500 小时	$400	10	1
v1.0	2025 年 1 月 27 日	几百小时	1000 小时	$1000	26+	?

使用方法

上述高级用法部分的代码可以在 Google Colab 中运行。

模型信息

属性	详情
模型架构	StyleTTS 2: https://arxiv.org/abs/2306.07691 ISTFTNet: https://arxiv.org/abs/2203.02395 仅解码器：无扩散，无编码器发布
架构设计	Li 等人 @ https://github.com/yl4579/StyleTTS2
训练者	`@rzvzn`（Discord）
支持语言	美式英语，英式英语
模型 SHA256 哈希值	`496dba118d1a58f5f3db2efc88dbdc216e0483fc89fe6e47ee1f2c53f18ad1e4`

训练详情

计算资源：使用 A100 80GB vRAM 约 1000 小时，成本约 $1000。
数据：Kokoro 仅在许可/无版权音频数据和 IPA 音素标签上进行训练。许可/无版权音频的示例包括：
- 公共领域音频
- 根据 Apache、MIT 等许可的音频
- 大型供应商的封闭^[2] TTS 模型生成的合成音频^[1]
  [1] https://copyright.gov/ai/ai_policy_guidance.pdf
  [2] 不使用开放 TTS 模型或“自定义语音克隆”生成的合成音频
总数据集大小：几百小时的音频