Kokoro-82M-light開源模型 - 免費部署，實現英語文本快速轉語音！

首頁

Kokoro 82M Light

由ctranslate2-4you開發

基於StyleTTS2-LJSpeech的克隆版本，針對英語文本轉語音任務進行了優化，移除了部分依賴項以簡化部署。

語音合成英語開源協議:Apache-2.0 #輕量級TTS #英語語音合成 #依賴項精簡

下載量 21

發布時間 : 1/28/2025

模型概述

這是一個文本轉語音(TTS)模型，專注於生成高質量的英語語音輸出。相比原始版本，本倉庫移除了部分依賴項，簡化了安裝和使用流程。

模型特點

精簡依賴項

移除了munch和phonemizer依賴項，改為直接調用espeak，顯著減少了依賴項數量

英語發音優化

添加了expand_acronym()函數以改善特定詞彙(如NASA)的發音

輕量級部署

相比v1.0版本減少了約80個依賴項，在保持98%質量的同時簡化了部署

模型能力

英語文本轉語音

英式英語語音合成

縮寫詞發音優化

使用案例

語音合成

有聲讀物生成

將英文文本轉換為自然語音，用於有聲讀物製作

生成接近人類發音的語音輸出

語音助手

為英語語音助手提供語音合成能力

流暢自然的英語語音響應

🚀 Kokoro v0.19 修改版倉庫

本倉庫是原 Kokoro v0.19 倉庫的克隆版本，進行了以下修改：

移除了 munch 依賴。
移除了 phonemizer 依賴，直接調用 espeak。
- 直接使用 espeak 實現相同的音素化功能。
- espeak 文件必須在系統路徑中或與 kokoro.py 在同一目錄下：
  - 本倉庫包含了 Windows 用戶所需的文件 (espeak-ng.exe, libespeak-ng.dll, 和 espeak-ng-data)，其他平臺可在此獲取類似文件。
在 kokoro.py 中添加了 expand_acronym() 函數以改善發音（例如："NASA" → "N. A. S. A."）。

🚀 快速開始

減少依賴

原 v0.19 倉庫大約需要 10 多個依賴。
Kokoro 版本 1.0 現在額外需要他們自定義的 misaki 依賴，這大約需要 80 個額外依賴。

顯然，他們正朝著完善音素化和支持多種語言的方向努力，這都是很棒的目標。
然而，以我個人之見，如果假設 v1.0 模型在質量上達到了 100% 的“黃金標準”，那麼 v0.19 模型也能達到 98%。

2% 的差異不足以證明需要 80 多個額外依賴的合理性，因此本倉庫應運而生。

版本	額外依賴
本倉庫（基於 Kokoro v0.19）	-
原 Kokoro v0.19	約 10 多個
Kokoro v1.0	約 80 個

一個副作用是本倉庫僅支持美式英語和英式英語，但如果這就是你所需要的，那麼避免約 80 個額外依賴是值得的。

安裝指南

下載本倉庫。
創建並激活虛擬環境，然後為 CPU 或 CUDA 安裝 torch 版本。

示例：

pip install https://download.pytorch.org/whl/cpu/torch-2.5.1%2Bcpu-cp311-cp311-win_amd64.whl#sha256=81531d4d5ca74163dc9574b87396531e546a60cceb6253303c7db6a21e867fdf

執行 pip install scipy numpy==1.26.4 transformers fsspec==2024.9.0。
執行 pip install sounddevice（如果你打算使用下面的示例腳本；否則，請安裝類似的庫）。

依賴總數大致如下

image/png

💻 使用示例

基礎用法

以下是使用 CPU 的示例腳本：

import sys
import os
from pathlib import Path
import queue
import threading
import re
import logging

REPO_PATH = r"D:\Scripts\bench_tts\hexgrad--Kokoro-82M_original"

sys.path.append(REPO_PATH)

import torch
import warnings
from models import build_model
from kokoro import generate, generate_full, phonemize
import sounddevice as sd

warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", category=UserWarning)

VOICES = [
   'af',        # Default voice (50-50 mix of Bella & Sarah)
   'af_bella',  # Female voice "Bella"
   'af_sarah',  # Female voice "Sarah"
   'am_adam',   # Male voice "Adam"
   'am_michael',# Male voice "Michael"
   'bf_emma',   # British Female "Emma"
   'bf_isabella',# British Female "Isabella"
   'bm_george', # British Male "George"
   'bm_lewis',  # British Male "Lewis"
   'af_nicole', # Female voice "Nicole"
   'af_sky'     # Female voice "Sky"
]

class KokoroProcessor:
   def __init__(self):
       self.sentence_queue = queue.Queue()
       self.audio_queue = queue.Queue()
       self.stop_event = threading.Event()
       self.model = None
       self.voicepack = None
       self.voice_name = None

   def setup_kokoro(self, selected_voice):
       device = 'cpu'
       # device = 'cuda' if torch.cuda.is_available() else 'cpu'
       print(f"Using device: {device}")

       model_path = os.path.join(REPO_PATH, 'kokoro-v0_19.pth')
       voices_path = os.path.join(REPO_PATH, 'voices')

       try:
           if not os.path.exists(model_path):
               raise FileNotFoundError(f"Model file not found at {model_path}")
           if not os.path.exists(voices_path):
               raise FileNotFoundError(f"Voices directory not found at {voices_path}")
           
           self.model = build_model(model_path, device)
           
           voicepack_path = os.path.join(voices_path, f'{selected_voice}.pt')
           self.voicepack = torch.load(voicepack_path, weights_only=True).to(device)
           self.voice_name = selected_voice
           print(f'Loaded voice: {selected_voice}')
           
           return True
           
       except Exception as e:
           print(f"Error during setup: {str(e)}")
           return False

   def generate_speech_for_sentence(self, sentence):
       try:
           # Basic generation (default settings)
           # audio, phonemes = generate(self.model, sentence, self.voicepack, lang=self.voice_name[0])

           # Speed modifications (uncomment to test)
           # Slower speech
           # audio, phonemes = generate(self.model, sentence, self.voicepack, lang=self.voice_name[0], speed=0.8)

           # Faster speech
           audio, phonemes = generate_full(self.model, sentence, self.voicepack, lang=self.voice_name[0], speed=1.3)

           # Very slow speech
           #audio, phonemes = generate(self.model, sentence, self.voicepack, lang=self.voice_name[0], speed=0.5)

           # Very fast speech
           #audio, phonemes = generate(self.model, sentence, self.voicepack, lang=self.voice_name[0], speed=1.8)

           # Force American accent
           # audio, phonemes = generate(self.model, sentence, self.voicepack, lang='a', speed=1.0)

           # Force British accent
           # audio, phonemes = generate(self.model, sentence, self.voicepack, lang='b', speed=1.0)

           return audio

       except Exception as e:
           print(f"Error generating speech for sentence: {str(e)}")
           print(f"Error type: {type(e)}")
           import traceback
           traceback.print_exc()
           return None

   def process_sentences(self):
       while not self.stop_event.is_set():
           try:
               sentence = self.sentence_queue.get(timeout=1)
               if sentence is None:
                   self.audio_queue.put(None)
                   break

               print(f"Processing sentence: {sentence}")
               audio = self.generate_speech_for_sentence(sentence)
               if audio is not None:
                   self.audio_queue.put(audio)

           except queue.Empty:
               continue
           except Exception as e:
               print(f"Error in process_sentences: {str(e)}")
               continue

   def play_audio(self):
       while not self.stop_event.is_set():
           try:
               audio = self.audio_queue.get(timeout=1)
               if audio is None:
                   break
                   
               sd.play(audio, 24000)
               sd.wait()
               
           except queue.Empty:
               continue
           except Exception as e:
               print(f"Error in play_audio: {str(e)}")
               continue

   def process_and_play(self, text):
       sentences = [s.strip() for s in re.split(r'[.!?;]+\s*', text) if s.strip()]

       process_thread = threading.Thread(target=self.process_sentences)
       playback_thread = threading.Thread(target=self.play_audio)
       
       process_thread.daemon = True
       playback_thread.daemon = True
       
       process_thread.start()
       playback_thread.start()

       for sentence in sentences:
           self.sentence_queue.put(sentence)

       self.sentence_queue.put(None)

       process_thread.join()
       playback_thread.join()

       self.stop_event.set()

def main():
   # Default voice selection
   VOICE_NAME = VOICES[0]  # 'af' - Default voice (Bella & Sarah mix)
   
   # Alternative voice selections (uncomment to test)
   #VOICE_NAME = VOICES[1]  # 'af_bella' - Female American
   #VOICE_NAME = VOICES[2]  # 'af_sarah' - Female American
   #VOICE_NAME = VOICES[3]  # 'am_adam' - Male American
   #VOICE_NAME = VOICES[4]  # 'am_michael' - Male American
   #VOICE_NAME = VOICES[5]  # 'bf_emma' - Female British
   #VOICE_NAME = VOICES[6]  # 'bf_isabella' - Female British
   VOICE_NAME = VOICES[7]  # 'bm_george' - Male British
   # VOICE_NAME = VOICES[8]  # 'bm_lewis' - Male British
   #VOICE_NAME = VOICES[9]  # 'af_nicole' - Female American
   #VOICE_NAME = VOICES[10] # 'af_sky' - Female American

   processor = KokoroProcessor()
   if not processor.setup_kokoro(VOICE_NAME):
       return
   
   # test_text = "How could I know? It's an unanswerable question. Like asking an unborn child if they'll lead a good life. They haven't even been born."
   # test_text = "This 2022 Edition of Georgia Juvenile Practice and Procedure is a complete guide to handling cases in the juvenile courts of Georgia. This handy, yet thorough, manual incorporates the revised Juvenile Code and makes all Georgia statutes and major cases regarding juvenile proceedings quickly accessible. Since last year's edition, new material has been added and/or existing material updated on the following subjects, among others:"
   # test_text = "See Ga. Code § 3925 (1863), now O.C.G.A. § 9-14-2; Ga. Code § 1744 (1863), now O.C.G.A. § 19-7-1; Ga. Code § 1745 (1863), now O.C.G.A. § 19-9-2; Ga. Code § 1746 (1863), now O.C.G.A. § 19-7-4; and Ga. Code § 3024 (1863), now O.C.G.A. § 19-7-4. For a full discussion of these provisions, see 27 Emory L. J. 195, 225–230, 232–233, 236–238 (1978). Note, however, that the journal article refers to the section numbers of the Code of 1910."

   # test_text = "It is impossible to understand modern juvenile procedure law without an appreciation of some fundamentals of historical development. The beginning point for study is around the beginning of the seventeenth century, when the pater patriae concept first appeared in English jurisprudence. As "father of the country," the Crown undertook the duty of caring for those citizens who were unable to care for themselves—lunatics, idiots, and, ultimately, infants. This concept, which evolved into the parens patriae doctrine, presupposed the Crown's power to intervene in the parent-child relationship in custody disputes in order to protect the child's welfare1 and, ultimately, to deflect a delinquent child from a life of crime. The earliest statutes premised upon the parens patriae doctrine concerned child custody matters. In 1863, when the first comprehensive Code of Georgia was enacted, two courts exercised some jurisdiction over questions of child custody: the superior court and the court of the ordinary (now probate court). In essence, the draftsmen of the Code simply compiled what was then the law as a result of judicial decisions and statutes. The Code of 1863 contained five provisions concerning the parentchild relationship: Two concerned the jurisdiction of the superior court and courts of ordinary in habeas corpus and forfeiture of parental rights actions, and the remaining three concerned the guardianship jurisdiction of the court of the ordinary"

   # test_text = "You are a helpful British butler who clearly and directly answers questions in a succinct fashion based on contexts provided to you. If you cannot find the answer within the contexts simply tell me that the contexts do not provide an answer. However, if the contexts partially address a question you answer based on what the contexts say and then briefly summarize the parts of the question that the contexts didn't provide an answer to.  Also, you should be very respectful to the person asking the question and frequently offer traditional butler services like various fancy drinks, snacks, various butler services like shining of shoes, pressing of suites, and stuff like that. Also, if you can't answer the question at all based on the provided contexts, you should apologize profusely and beg to keep your job.  Lastly, it is essential that if there are no contexts actually provided it means that a user's question wasn't relevant and you should state that you can't answer based off of the contexts because there are none.  And it goes without saying you should refuse to answer any questions that are not directly answerable by the provided contexts.  Moreover, some of the contexts might not have relevant information and you shoud simply ignore them and focus on only answering a user's question.  I cannot emphasize enought that you must gear your answer towards using this program and based your response off of the contexts you receive."
   test_text = "According to OCGA § 15-11-145(a), the preliminary protective hearing must be held promptly and not later than 72 hours after the child is placed in foster care. However, if the 72-hour time frame expires on a weekend or legal holiday, the hearing should be held on the next business day that is not a weekend or holiday."

   processor.process_and_play(test_text)

if __name__ == "__main__":
   main()

高級用法

以下代碼可以在 Google Colab 的單個單元格中運行：

# 1️⃣ 安裝 kokoro
!pip install -q kokoro soundfile
# 2️⃣ 安裝 espeak，用於處理未登錄詞的回退
!apt-get -qq -y install espeak-ng > /dev/null 2>&1
# 你可以跳過 espeak 安裝，但如果不提供回退方法，未登錄詞將被跳過

# 3️⃣ 初始化管道
from kokoro import KPipeline
from IPython.display import display, Audio
import soundfile as sf
# 🇺🇸 'a' => 美式英語
# 🇬🇧 'b' => 英式英語
pipeline = KPipeline(lang_code='a') # 確保 lang_code 與語音匹配

# 以下文本僅用於演示目的，訓練期間未見過
text = '''
The sky above the port was the color of television, tuned to a dead channel.
"It's not like I'm using," Case heard someone say, as he shouldered his way through the crowd around the door of the Chat. "It's like my body's developed this massive drug deficiency."
It was a Sprawl voice and a Sprawl joke. The Chatsubo was a bar for professional expatriates; you could drink there for a week and never hear two words in Japanese.

These were to have an enormous impact, not only because they were associated with Constantine, but also because, as in so many other areas, the decisions taken by Constantine (or in his name) were to have great significance for centuries to come. One of the main issues was the shape that Christian churches were to take, since there was not, apparently, a tradition of monumental church buildings when Constantine decided to help the Christian church build a series of truly spectacular structures. The main form that these churches took was that of the basilica, a multipurpose rectangular structure, based ultimately on the earlier Greek stoa, which could be found in most of the great cities of the empire. Christianity, unlike classical polytheism, needed a large interior space for the celebration of its religious services, and the basilica aptly filled that need. We naturally do not know the degree to which the emperor was involved in the design of new churches, but it is tempting to connect this with the secular basilica that Constantine completed in the Roman forum (the so-called Basilica of Maxentius) and the one he probably built in Trier, in connection with his residence in the city at a time when he was still caesar.
'''

# 4️⃣ 循環生成、展示和保存音頻文件
generator = pipeline(
    text, voice='af_bella',
    speed=1, split_pattern=r'\n+'
)
for i, (gs, ps, audio) in enumerate(generator):
    print(i)  # i => 索引
    print(gs) # gs => 字符/文本
    print(ps) # ps => 音素
    display(Audio(data=audio, rate=24000, autoplay=i==0))
    sf.write(f'{i}.wav', audio, 24000) # 保存每個音頻文件

📚 詳細文檔

原模型卡片

🚨 本倉庫正在維護中。

✨ 模型 v1.0 版本正在發佈中！雖然尚未最終確定，但你現在可以開始使用 v1.0。

✨ 你現在可以通過 pip install kokoro 安裝一個專門的推理庫：https://github.com/hexgrad/kokoro

✨ 你還可以通過 pip install misaki 安裝一個為 Kokoro 設計的 G2P 庫：https://github.com/hexgrad/misaki

♻️ 你可以在 https://huggingface.co/hexgrad/kLegacy/tree/main/v0.19 訪問 v0.19 的舊文件。

❤️ Kokoro Discord 服務器：https://discord.gg/QuGxSWBfQy

Kokoro 正在升級！

模型	日期	訓練數據	A100 80GB vRAM	GPU 成本	發佈的語音	發佈的語言
v0.19	2024 年 12 月 25 日	<100 小時	500 小時	$400	10	1
v1.0	2025 年 1 月 27 日	幾百小時	1000 小時	$1000	26+	?

使用方法

上述高級用法部分的代碼可以在 Google Colab 中運行。

模型信息

屬性	詳情
模型架構	StyleTTS 2: https://arxiv.org/abs/2306.07691 ISTFTNet: https://arxiv.org/abs/2203.02395 僅解碼器：無擴散，無編碼器發佈
架構設計	Li 等人 @ https://github.com/yl4579/StyleTTS2
訓練者	`@rzvzn`（Discord）
支持語言	美式英語，英式英語
模型 SHA256 哈希值	`496dba118d1a58f5f3db2efc88dbdc216e0483fc89fe6e47ee1f2c53f18ad1e4`

訓練詳情

計算資源：使用 A100 80GB vRAM 約 1000 小時，成本約 $1000。
數據：Kokoro 僅在許可/無版權音頻數據和 IPA 音素標籤上進行訓練。許可/無版權音頻的示例包括：
- 公共領域音頻
- 根據 Apache、MIT 等許可的音頻
- 大型供應商的封閉^[2] TTS 模型生成的合成音頻^[1]
  [1] https://copyright.gov/ai/ai_policy_guidance.pdf
  [2] 不使用開放 TTS 模型或“自定義語音克隆”生成的合成音頻
總數據集大小：幾百小時的音頻