Kokoro-82M-light Open-source Model - Free Deployment for Rapid English Text-to-Speech Conversion!

Kokoro 82M Light

Developed by ctranslate2-4you

A clone version based on StyleTTS2-LJSpeech, optimized for English text-to-speech tasks with reduced dependencies for simplified deployment.

Speech Synthesis EnglishOpen Source License:Apache-2.0 #Lightweight TTS #English Speech Synthesis #Dependency Streamlining

Downloads 21

Release Time : 1/28/2025

Model Overview

This is a text-to-speech (TTS) model focused on generating high-quality English speech output. Compared to the original version, this repository removes certain dependencies to simplify installation and usage.

Model Features

Streamlined Dependencies

Removed munch and phonemizer dependencies, replaced with direct calls to espeak, significantly reducing dependency count

English Pronunciation Optimization

Added expand_acronym() function to improve pronunciation of specific terms (e.g., NASA)

Lightweight Deployment

Reduced approximately 80 dependencies compared to v1.0, simplifying deployment while maintaining 98% quality

Model Capabilities

English Text-to-Speech

British English Speech Synthesis

Acronym Pronunciation Optimization

Use Cases

Speech Synthesis

Audiobook Generation

Convert English text into natural speech for audiobook production

Generates near-human pronunciation speech output

Voice Assistants

Provide speech synthesis capabilities for English voice assistants

Fluid and natural English speech responses

🚀 Kokoro Repository Modifications

This repository is a clone of the original Kokoro v0.19 repository, with several key modifications aimed at reducing dependencies while maintaining high - quality text - to - speech functionality.

Key Features

Removed the munch dependency.
Removed the phonemizer dependency and directly call espeak for phonemization.
Added an expand_acronym() function to kokoro.py to enhance pronunciation.

🚀 Quick Start

Reduction of Dependencies

The original v0.19 repository required around 10+ dependencies. Kokoro Version 1.0 additionally requires their custom misaki dependency, which brings in approximately 80 additional dependencies.

However, if we assume the v1.0 model is the "gold standard" at 100% in terms of quality, the v0.19 model would be 98%. The 2% difference in quality does not justify the addition of 80+ dependencies, which is why this repository exists.

Version	Additional Dependencies
This Repository (based on Kokoro v0.19)	-
Original Kokoro v0.19	~10+ additional
Kokoro v1.0	~80 additional

Note that this repository only supports English and British English, but it's worth it to avoid ~80 additional dependencies if that's all you need.

📦 Installation

Download this repository.
Create a virtual environment, activate it, and install a torch version for either CPU or CUDA. Example:

pip install https://download.pytorch.org/whl/cpu/torch-2.5.1%2Bcpu-cp311-cp311-win_amd64.whl#sha256=81531d4d5ca74163dc9574b87396531e546a60cceb6253303c7db6a21e867fdf

pip install scipy numpy==1.26.4 transformers fsspec==2024.9.0
pip install sounddevice (if you intend to use the example script below; otherwise, install a similar library).

💻 Usage Examples

Basic Usage

import sys
import os
from pathlib import Path
import queue
import threading
import re
import logging

REPO_PATH = r"D:\Scripts\bench_tts\hexgrad--Kokoro-82M_original"

sys.path.append(REPO_PATH)

import torch
import warnings
from models import build_model
from kokoro import generate, generate_full, phonemize
import sounddevice as sd

warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", category=UserWarning)

VOICES = [
   'af',        # Default voice (50-50 mix of Bella & Sarah)
   'af_bella',  # Female voice "Bella"
   'af_sarah',  # Female voice "Sarah"
   'am_adam',   # Male voice "Adam"
   'am_michael',# Male voice "Michael"
   'bf_emma',   # British Female "Emma"
   'bf_isabella',# British Female "Isabella"
   'bm_george', # British Male "George"
   'bm_lewis',  # British Male "Lewis"
   'af_nicole', # Female voice "Nicole"
   'af_sky'     # Female voice "Sky"
]

class KokoroProcessor:
   def __init__(self):
       self.sentence_queue = queue.Queue()
       self.audio_queue = queue.Queue()
       self.stop_event = threading.Event()
       self.model = None
       self.voicepack = None
       self.voice_name = None

   def setup_kokoro(self, selected_voice):
       device = 'cpu'
       # device = 'cuda' if torch.cuda.is_available() else 'cpu'
       print(f"Using device: {device}")

       model_path = os.path.join(REPO_PATH, 'kokoro-v0_19.pth')
       voices_path = os.path.join(REPO_PATH, 'voices')

       try:
           if not os.path.exists(model_path):
               raise FileNotFoundError(f"Model file not found at {model_path}")
           if not os.path.exists(voices_path):
               raise FileNotFoundError(f"Voices directory not found at {voices_path}")
           
           self.model = build_model(model_path, device)
           
           voicepack_path = os.path.join(voices_path, f'{selected_voice}.pt')
           self.voicepack = torch.load(voicepack_path, weights_only=True).to(device)
           self.voice_name = selected_voice
           print(f'Loaded voice: {selected_voice}')
           
           return True
           
       except Exception as e:
           print(f"Error during setup: {str(e)}")
           return False

   def generate_speech_for_sentence(self, sentence):
       try:
           # Basic generation (default settings)
           # audio, phonemes = generate(self.model, sentence, self.voicepack, lang=self.voice_name[0])

           # Speed modifications (uncomment to test)
           # Slower speech
           # audio, phonemes = generate(self.model, sentence, self.voicepack, lang=self.voice_name[0], speed=0.8)

           # Faster speech
           audio, phonemes = generate_full(self.model, sentence, self.voicepack, lang=self.voice_name[0], speed=1.3)

           # Very slow speech
           #audio, phonemes = generate(self.model, sentence, self.voicepack, lang=self.voice_name[0], speed=0.5)

           # Very fast speech
           #audio, phonemes = generate(self.model, sentence, self.voicepack, lang=self.voice_name[0], speed=1.8)

           # Force American accent
           # audio, phonemes = generate(self.model, sentence, self.voicepack, lang='a', speed=1.0)

           # Force British accent
           # audio, phonemes = generate(self.model, sentence, self.voicepack, lang='b', speed=1.0)

           return audio

       except Exception as e:
           print(f"Error generating speech for sentence: {str(e)}")
           print(f"Error type: {type(e)}")
           import traceback
           traceback.print_exc()
           return None

   def process_sentences(self):
       while not self.stop_event.is_set():
           try:
               sentence = self.sentence_queue.get(timeout=1)
               if sentence is None:
                   self.audio_queue.put(None)
                   break

               print(f"Processing sentence: {sentence}")
               audio = self.generate_speech_for_sentence(sentence)
               if audio is not None:
                   self.audio_queue.put(audio)

           except queue.Empty:
               continue
           except Exception as e:
               print(f"Error in process_sentences: {str(e)}")
               continue

   def play_audio(self):
       while not self.stop_event.is_set():
           try:
               audio = self.audio_queue.get(timeout=1)
               if audio is None:
                   break
                   
               sd.play(audio, 24000)
               sd.wait()
               
           except queue.Empty:
               continue
           except Exception as e:
               print(f"Error in play_audio: {str(e)}")
               continue

   def process_and_play(self, text):
       sentences = [s.strip() for s in re.split(r'[.!?;]+\s*', text) if s.strip()]

       process_thread = threading.Thread(target=self.process_sentences)
       playback_thread = threading.Thread(target=self.play_audio)
       
       process_thread.daemon = True
       playback_thread.daemon = True
       
       process_thread.start()
       playback_thread.start()

       for sentence in sentences:
           self.sentence_queue.put(sentence)

       self.sentence_queue.put(None)

       process_thread.join()
       playback_thread.join()

       self.stop_event.set()

def main():
   # Default voice selection
   VOICE_NAME = VOICES[0]  # 'af' - Default voice (Bella & Sarah mix)
   
   # Alternative voice selections (uncomment to test)
   #VOICE_NAME = VOICES[1]  # 'af_bella' - Female American
   #VOICE_NAME = VOICES[2]  # 'af_sarah' - Female American
   #VOICE_NAME = VOICES[3]  # 'am_adam' - Male American
   #VOICE_NAME = VOICES[4]  # 'am_michael' - Male American
   #VOICE_NAME = VOICES[5]  # 'bf_emma' - Female British
   #VOICE_NAME = VOICES[6]  # 'bf_isabella' - Female British
   VOICE_NAME = VOICES[7]  # 'bm_george' - Male British
   # VOICE_NAME = VOICES[8]  # 'bm_lewis' - Male British
   #VOICE_NAME = VOICES[9]  # 'af_nicole' - Female American
   #VOICE_NAME = VOICES[10] # 'af_sky' - Female American

   processor = KokoroProcessor()
   if not processor.setup_kokoro(VOICE_NAME):
       return
   
   # test_text = "How could I know? It's an unanswerable question. Like asking an unborn child if they'll lead a good life. They haven't even been born."
   # test_text = "This 2022 Edition of Georgia Juvenile Practice and Procedure is a complete guide to handling cases in the juvenile courts of Georgia. This handy, yet thorough, manual incorporates the revised Juvenile Code and makes all Georgia statutes and major cases regarding juvenile proceedings quickly accessible. Since last year's edition, new material has been added and/or existing material updated on the following subjects, among others:"
   # test_text = "See Ga. Code § 3925 (1863), now O.C.G.A. § 9-14-2; Ga. Code § 1744 (1863), now O.C.G.A. § 19-7-1; Ga. Code § 1745 (1863), now O.C.G.A. § 19-9-2; Ga. Code § 1746 (1863), now O.C.G.A. § 19-7-4; and Ga. Code § 3024 (1863), now O.C.G.A. § 19-7-4. For a full discussion of these provisions, see 27 Emory L. J. 195, 225–230, 232–233, 236–238 (1978). Note, however, that the journal article refers to the section numbers of the Code of 1910."

   # test_text = "It is impossible to understand modern juvenile procedure law without an appreciation of some fundamentals of historical development. The beginning point for study is around the beginning of the seventeenth century, when the pater patriae concept first appeared in English jurisprudence. As "father of the country," the Crown undertook the duty of caring for those citizens who were unable to care for themselves—lunatics, idiots, and, ultimately, infants. This concept, which evolved into the parens patriae doctrine, presupposed the Crown's power to intervene in the parent-child relationship in custody disputes in order to protect the child's welfare1 and, ultimately, to deflect a delinquent child from a life of crime. The earliest statutes premised upon the parens patriae doctrine concerned child custody matters. In 1863, when the first comprehensive Code of Georgia was enacted, two courts exercised some jurisdiction over questions of child custody: the superior court and the court of the ordinary (now probate court). In essence, the draftsmen of the Code simply compiled what was then the law as a result of judicial decisions and statutes. The Code of 1863 contained five provisions concerning the parentchild relationship: Two concerned the jurisdiction of the superior court and courts of ordinary in habeas corpus and forfeiture of parental rights actions, and the remaining three concerned the guardianship jurisdiction of the court of the ordinary"

   # test_text = "You are a helpful British butler who clearly and directly answers questions in a succinct fashion based on contexts provided to you. If you cannot find the answer within the contexts simply tell me that the contexts do not provide an answer. However, if the contexts partially address a question you answer based on what the contexts say and then briefly summarize the parts of the question that the contexts didn't provide an answer to.  Also, you should be very respectful to the person asking the question and frequently offer traditional butler services like various fancy drinks, snacks, various butler services like shining of shoes, pressing of suites, and stuff like that. Also, if you can't answer the question at all based on the provided contexts, you should apologize profusely and beg to keep your job.  Lastly, it is essential that if there are no contexts actually provided it means that a user's question wasn't relevant and you should state that you can't answer based off of the contexts because there are none.  And it goes without saying you should refuse to answer any questions that are not directly answerable by the provided contexts.  Moreover, some of the contexts might not have relevant information and you shoud simply ignore them and focus on only answering a user's question.  I cannot emphasize enought that you must gear your answer towards using this program and based your response off of the contexts you receive."
   test_text = "According to OCGA § 15-11-145(a), the preliminary protective hearing must be held promptly and not later than 72 hours after the child is placed in foster care. However, if the 72-hour time frame expires on a weekend or legal holiday, the hearing should be held on the next business day that is not a weekend or holiday."

   processor.process_and_play(test_text)

if __name__ == "__main__":
   main()

Advanced Usage

The following can be run in a single cell on Google Colab.

# 1️⃣ Install kokoro
!pip install -q kokoro soundfile
# 2️⃣ Install espeak, used for out-of-dictionary fallback
!apt-get -qq -y install espeak-ng > /dev/null 2>&1
# You can skip espeak installation, but OOD words will be skipped unless you provide a fallback

# 3️⃣ Initalize a pipeline
from kokoro import KPipeline
from IPython.display import display, Audio
import soundfile as sf
# 🇺🇸 'a' => American English
# 🇬🇧 'b' => British English
pipeline = KPipeline(lang_code='a') # make sure lang_code matches voice

# The following text is for demonstration purposes only, unseen during training
text = '''
The sky above the port was the color of television, tuned to a dead channel.
"It's not like I'm using," Case heard someone say, as he shouldered his way through the crowd around the door of the Chat. "It's like my body's developed this massive drug deficiency."
It was a Sprawl voice and a Sprawl joke. The Chatsubo was a bar for professional expatriates; you could drink there for a week and never hear two words in Japanese.

These were to have an enormous impact, not only because they were associated with Constantine, but also because, as in so many other areas, the decisions taken by Constantine (or in his name) were to have great significance for centuries to come. One of the main issues was the shape that Christian churches were to take, since there was not, apparently, a tradition of monumental church buildings when Constantine decided to help the Christian church build a series of truly spectacular structures. The main form that these churches took was that of the basilica, a multipurpose rectangular structure, based ultimately on the earlier Greek stoa, which could be found in most of the great cities of the empire. Christianity, unlike classical polytheism, needed a large interior space for the celebration of its religious services, and the basilica aptly filled that need. We naturally do not know the degree to which the emperor was involved in the design of new churches, but it is tempting to connect this with the secular basilica that Constantine completed in the Roman forum (the so-called Basilica of Maxentius) and the one he probably built in Trier, in connection with his residence in the city at a time when he was still caesar.
'''

# 4️⃣ Generate, display, and save audio files in a loop.
generator = pipeline(
    text, voice='af_bella',
    speed=1, split_pattern=r'\n+'
)
for i, (gs, ps, audio) in enumerate(generator):
    print(i)  # i => index
    print(gs) # gs => graphemes/text
    print(ps) # ps => phonemes
    display(Audio(data=audio, rate=24000, autoplay=i==0))
    sf.write(f'{i}.wav', audio, 24000) # save each audio file

📚 Documentation

Original Model Card

ORIGINAL MODEL CARD

🚨 This repository is undergoing maintenance.

✨ Model v1.0 release is underway! Things are not yet finalized, but you can start using v1.0 now.

✨ You can now pip install kokoro, a dedicated inference library: https://github.com/hexgrad/kokoro

✨ You can also pip install misaki, a G2P library designed for Kokoro: https://github.com/hexgrad/misaki

♻️ You can access old files for v0.19 at https://huggingface.co/hexgrad/kLegacy/tree/main/v0.19

❤️ Kokoro Discord Server: https://discord.gg/QuGxSWBfQy

Kokoro is getting an upgrade!

Model	Date	Training Data	A100 80GB vRAM	GPU Cost	Released Voices	Released Langs
v0.19	2024 Dec 25	<100h	500 hrs	$400	10	1
v1.0	2025 Jan 27	Few hundred hrs	1000 hrs	$1000	26+	?

Usage

The usage examples are provided above in the "💻 Usage Examples" section.

Model Facts

Property	Details
Model Type	StyleTTS 2: https://arxiv.org/abs/2306.07691; ISTFTNet: https://arxiv.org/abs/2203.02395; Decoder only: no diffusion, no encoder release
Architected by	Li et al @ https://github.com/yl4579/StyleTTS2
Trained by	`@rzvzn` on Discord
Supported Languages	American English, British English
Model SHA256 Hash	`496dba118d1a58f5f3db2efc88dbdc216e0483fc89fe6e47ee1f2c53f18ad1e4`

Training Details

Property	Details
Compute	About $1000 for 1000 hours of A100 80GB vRAM
Training Data	Kokoro was trained exclusively on permissive/non - copyrighted audio data and IPA phoneme labels. Examples of permissive/non - copyrighted audio include: - Public domain audio - Audio licensed under Apache, MIT, etc - Synthetic audio^[1] generated by closed^[2] TTS models from large providers [1] https://copyright.gov/ai/ai_policy_guidance.pdf [2] No synthetic audio from open TTS models or "custom voice clones"
Total Dataset Size	A few hundred hours of audio

Creative Commons Attribution

Audio Data	Duration Used	License	Added to Training Set After
Koniwa `tnc`	<1h	CC BY 3.0	v0.19 / 22 Nov 2024
SIWIS	<11h	CC BY 4.0	v0.19 / 22 Nov 2024

📄 License

This repository is licensed under the Apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご