CiSiMi-v0.1 Open-source Text-to-Speech Model - Adapted to Resource-constrained Environments, Efficiently Synthesizing Speech on CPU

Cisimi V0.1

Developed by KandirResearch

CiSiMi is an early prototype of a text-to-audio model designed for resource-constrained environments and capable of efficient operation on the CPU to achieve advanced speech synthesis.

Speech Synthesis English#Lightweight TTS #Efficient operation on CPU #English speech synthesis

Downloads 202

Release Time : 3/16/2025

Model Overview

CiSiMi is a text-to-audio model based on OuteTTS-0.3-500M, which can process text input and respond in both text and audio forms. This model is designed for resource-constrained environments and can run efficiently on the CPU through llama.cpp.

Model Features

Resource-efficient

Designed for resource-constrained environments and capable of efficient operation on the CPU

Open-source tools

Built based on open-source tools, demonstrating the power of open-source tools in creating accessible speech technology

Early prototype

Although still in the early stage, it represents a step towards popularizing advanced text-to-audio capabilities

Model Capabilities

Text-to-audio

Speech synthesis

English speech generation

Use Cases

Voice assistant

Voice Q&A

Users input text questions, and the model answers in voice form

Generate natural voice responses

Education

Voice learning assistance

Convert text learning materials into voice form

Assist visually impaired learners or provide a multi-modal learning experience

🚀 CiSiMi: A Text-to-Speech TTS Model

CiSiMi is an early prototype of a text-to-audio model that can process text inputs and respond with both text and audio. It's designed for resource-constrained environments and can run efficiently on CPU.

🚀 Quick Start

CiSiMi is an early prototype of a text-to-audio model that can process text inputs and respond with both text and audio. Built for resource-constrained environments, it's designed to run efficiently on CPU using llama.cpp, making advanced speech synthesis accessible even without powerful GPUs.

✨ Features

Can process text inputs and respond with both text and audio.
Designed to run efficiently on CPU, suitable for resource-constrained environments.

🔧 Technical Details

Model Specifications

Property	Details
Architecture	Based on OuteTTS - 0.3 - 500M
Languages	English
Pipeline	Text-to-audio
Parameters	500M
Training Dataset Size	~15k samples
Future Goals	Scale to 200k - 500k dataset with multi - turn conversation using both a 500M and a 1B parameter model variants, plus adding streaming for realtime.

Training Methodology

Dataset Preparation:
- Started with [gruhit - patel/alpaca_speech_instruct](https://huggingface.co/datasets/gruhit - patel/alpaca_speech_instruct).
- Cleaned by removing code, mathematical expressions, and non - English content.
- Filtered to keep only entries with input + output texts of 256 tokens or less.
Audio Generation:
- Converted text outputs to speech using [hexgrad/Kokoro - 82M](https://huggingface.co/hexgrad/Kokoro - 82M).
- Verified each audio generation using OpenAI Whisper.
- Published the resulting dataset as KandirResearch/Speech2Speech.
Model Training:
- Preprocessed dataset using modified OuteTTS methodology ([training details](https://github.com/edwko/OuteTTS/blob/8eb0fa369df6f3c062f7084ddc33d10bc28992be/examples/training/OuteTTS - 0.3/train.md)).
- Fine - tuned [OuteAI/OuteTTS - 0.3 - 500M](https://huggingface.co/OuteAI/OuteTTS - 0.3 - 500M) using Unsloth SFT.
- Trained for 6 epochs reaching a loss of 2.27 as a proof of concept.
- ~Trained for 3 epochs reaching a loss of 2.42 as a proof of concept~

💻 Usage Examples

Basic Usage

Explain to me how gravity works!

Advanced Usage

pip install outetts llama - cpp - python --upgrade
pip install huggingface_hub sounddevice

import torch
import outetts
import numpy as np
from huggingface_hub import hf_hub_download
from outetts.wav_tokenizer.audio_codec import AudioCodec
from outetts.version.v2.prompt_processor import PromptProcessor
from outetts.version.playback import ModelOutput

# Download the model
model_path = hf_hub_download(
    repo_id="KandirResearch/CiSiMi-v0.1",
    filename="unsloth.Q8_0.gguf",
)

# Configure the model
model_config = outetts.GGUFModelConfig_v2(
    model_path=model_path,
    tokenizer_path="KandirResearch/CiSiMi-v0.1",
)

# Initialize components
interface = outetts.InterfaceGGUF(model_version="0.3", cfg=model_config)
audio_codec = AudioCodec()
prompt_processor = PromptProcessor("KandirResearch/CiSiMi-v0.1")

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
gguf_model = interface.get_model()

# Helper function to extract audio from tokens
def get_audio(tokens):
    outputs = prompt_processor.extract_audio_from_tokens(tokens)
    if not outputs:
        return None
    audio_tensor = audio_codec.decode(torch.tensor([[outputs]], dtype=torch.int64).to(device))
    return ModelOutput(audio_tensor, audio_codec.sr)

# Helper function to clean text output
def extract_text_from_tts_output(tts_output):
    text = ""
    for line in tts_output.strip().split('\n'):
        if '<|audio_end|>' in line or '<|im_end|>' in line:
            continue
        if '<|' in line:
            word = line.split('<|')[0].strip()
            if word:
                text += word + " "
        else:
            text += line.strip() + " "
    return text.strip()

# Generate response function
def generate_response(instruction):
    prompt = f"<|im_start|>\nInstructions:\n{instruction}\n<|im_end|>\nAnswer:\n"
    gen_cfg = outetts.GenerationConfig(
        text=prompt, 
        temperature=0.6, 
        repetition_penalty=1.1, 
        max_length=4096, 
        speaker=None
    )
    
    input_ids = prompt_processor.tokenizer.encode(prompt)
    tokens = gguf_model.generate(input_ids, gen_cfg)
    
    output_text = prompt_processor.tokenizer.decode(tokens, skip_special_tokens=False)
    
    if "<|audio_end|>" in output_text:
        first_part, _, _ = output_text.partition("<|audio_end|>")
        
        if "<|audio_end|>\n<|im_end|>\n" not in first_part:
            first_part += "<|audio_end|>\n<|im_end|>\n"
            
        extracted_text = extract_text_from_tts_output(first_part)
        
        audio_start_pos = first_part.find("<|audio_start|>\n") + len("<|audio_start|>\n")
        audio_end_pos = first_part.find("<|audio_end|>\n<|im_end|>\n") + len("<|audio_end|>\n<|im_end|>\n")
        
        if audio_start_pos >= len("<|audio_start|>\n") and audio_end_pos > audio_start_pos:
            audio_tokens_text = first_part[audio_start_pos:audio_end_pos]
            audio_tokens = prompt_processor.tokenizer.encode(audio_tokens_text)
            audio_output = get_audio(audio_tokens)
            
            if audio_output is not None and hasattr(audio_output, 'audio') and audio_output.audio is not None:
                audio_numpy = audio_output.audio.cpu().numpy()
                if audio_numpy.ndim > 1:
                    audio_numpy = audio_numpy.squeeze()
                
                return extracted_text, (audio_output.sr, audio_numpy)
    
    return output_text, None

# Example usage
question = "What is the meaning of life?"
response_text, response_audio = generate_response(question)
print(response_text)

# Play audio if available
if response_audio is not None:
    if "ipykernel" in sys.modules:
        from IPython.display import display, Audio
        display(Audio(response_audio[1], rate=response_audio[0], autoplay=True))
    else:
        import sounddevice as sd
        sd.play(response_audio[1], samplerate=response_audio[0])
        sd.wait()

📚 Documentation

Limitations

This early prototype has several areas for improvement:

Limited training data (~15k samples).
Basic prompt/chat template structure.
Opportunity to optimize training hyperparameters.
Potential for multi - turn conversation capabilities.

Potential Limitation: This type of model quickly fills up context window, making smaller models generally more practical for implementation.

Acknowledgments & Citations

This model builds on the following open - source projects:

[OuteAI/OuteTTS - 0.3 - 500M](https://huggingface.co/OuteAI/OuteTTS - 0.3 - 500M) - Base model
[gruhit - patel/alpaca_speech_instruct](https://huggingface.co/datasets/gruhit - patel/alpaca_speech_instruct) - Initial dataset
[hexgrad/Kokoro - 82M](https://huggingface.co/hexgrad/Kokoro - 82M) - TTS generation
OpenAI Whisper - Speech verification
Unsloth - Training optimization

📄 License

This project is licensed under the cc - by - sa - 4.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご