Model Overview
Model Features
Model Capabilities
Use Cases
🚀 CiSiMi: A Text-to-Speech TTS Model
CiSiMi is an early prototype of a text-to-audio model that can process text inputs and respond with both text and audio. It's designed for resource-constrained environments and can run efficiently on CPU.
🚀 Quick Start
CiSiMi is an early prototype of a text-to-audio model that can process text inputs and respond with both text and audio. Built for resource-constrained environments, it's designed to run efficiently on CPU using llama.cpp, making advanced speech synthesis accessible even without powerful GPUs.
✨ Features
- Can process text inputs and respond with both text and audio.
- Designed to run efficiently on CPU, suitable for resource-constrained environments.
🔧 Technical Details
Model Specifications
Property | Details |
---|---|
Architecture | Based on OuteTTS - 0.3 - 500M |
Languages | English |
Pipeline | Text-to-audio |
Parameters | 500M |
Training Dataset Size | ~15k samples |
Future Goals | Scale to 200k - 500k dataset with multi - turn conversation using both a 500M and a 1B parameter model variants, plus adding streaming for realtime. |
Training Methodology
- Dataset Preparation:
- Started with [gruhit - patel/alpaca_speech_instruct](https://huggingface.co/datasets/gruhit - patel/alpaca_speech_instruct).
- Cleaned by removing code, mathematical expressions, and non - English content.
- Filtered to keep only entries with input + output texts of 256 tokens or less.
- Audio Generation:
- Converted text outputs to speech using [hexgrad/Kokoro - 82M](https://huggingface.co/hexgrad/Kokoro - 82M).
- Verified each audio generation using OpenAI Whisper.
- Published the resulting dataset as KandirResearch/Speech2Speech.
- Model Training:
- Preprocessed dataset using modified OuteTTS methodology ([training details](https://github.com/edwko/OuteTTS/blob/8eb0fa369df6f3c062f7084ddc33d10bc28992be/examples/training/OuteTTS - 0.3/train.md)).
- Fine - tuned [OuteAI/OuteTTS - 0.3 - 500M](https://huggingface.co/OuteAI/OuteTTS - 0.3 - 500M) using Unsloth SFT.
- Trained for 6 epochs reaching a loss of 2.27 as a proof of concept.
- ~Trained for 3 epochs reaching a loss of 2.42 as a proof of concept~
💻 Usage Examples
Basic Usage
Explain to me how gravity works!
Advanced Usage
pip install outetts llama - cpp - python --upgrade
pip install huggingface_hub sounddevice
import torch
import outetts
import numpy as np
from huggingface_hub import hf_hub_download
from outetts.wav_tokenizer.audio_codec import AudioCodec
from outetts.version.v2.prompt_processor import PromptProcessor
from outetts.version.playback import ModelOutput
# Download the model
model_path = hf_hub_download(
repo_id="KandirResearch/CiSiMi-v0.1",
filename="unsloth.Q8_0.gguf",
)
# Configure the model
model_config = outetts.GGUFModelConfig_v2(
model_path=model_path,
tokenizer_path="KandirResearch/CiSiMi-v0.1",
)
# Initialize components
interface = outetts.InterfaceGGUF(model_version="0.3", cfg=model_config)
audio_codec = AudioCodec()
prompt_processor = PromptProcessor("KandirResearch/CiSiMi-v0.1")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
gguf_model = interface.get_model()
# Helper function to extract audio from tokens
def get_audio(tokens):
outputs = prompt_processor.extract_audio_from_tokens(tokens)
if not outputs:
return None
audio_tensor = audio_codec.decode(torch.tensor([[outputs]], dtype=torch.int64).to(device))
return ModelOutput(audio_tensor, audio_codec.sr)
# Helper function to clean text output
def extract_text_from_tts_output(tts_output):
text = ""
for line in tts_output.strip().split('\n'):
if '<|audio_end|>' in line or '<|im_end|>' in line:
continue
if '<|' in line:
word = line.split('<|')[0].strip()
if word:
text += word + " "
else:
text += line.strip() + " "
return text.strip()
# Generate response function
def generate_response(instruction):
prompt = f"<|im_start|>\nInstructions:\n{instruction}\n<|im_end|>\nAnswer:\n"
gen_cfg = outetts.GenerationConfig(
text=prompt,
temperature=0.6,
repetition_penalty=1.1,
max_length=4096,
speaker=None
)
input_ids = prompt_processor.tokenizer.encode(prompt)
tokens = gguf_model.generate(input_ids, gen_cfg)
output_text = prompt_processor.tokenizer.decode(tokens, skip_special_tokens=False)
if "<|audio_end|>" in output_text:
first_part, _, _ = output_text.partition("<|audio_end|>")
if "<|audio_end|>\n<|im_end|>\n" not in first_part:
first_part += "<|audio_end|>\n<|im_end|>\n"
extracted_text = extract_text_from_tts_output(first_part)
audio_start_pos = first_part.find("<|audio_start|>\n") + len("<|audio_start|>\n")
audio_end_pos = first_part.find("<|audio_end|>\n<|im_end|>\n") + len("<|audio_end|>\n<|im_end|>\n")
if audio_start_pos >= len("<|audio_start|>\n") and audio_end_pos > audio_start_pos:
audio_tokens_text = first_part[audio_start_pos:audio_end_pos]
audio_tokens = prompt_processor.tokenizer.encode(audio_tokens_text)
audio_output = get_audio(audio_tokens)
if audio_output is not None and hasattr(audio_output, 'audio') and audio_output.audio is not None:
audio_numpy = audio_output.audio.cpu().numpy()
if audio_numpy.ndim > 1:
audio_numpy = audio_numpy.squeeze()
return extracted_text, (audio_output.sr, audio_numpy)
return output_text, None
# Example usage
question = "What is the meaning of life?"
response_text, response_audio = generate_response(question)
print(response_text)
# Play audio if available
if response_audio is not None:
if "ipykernel" in sys.modules:
from IPython.display import display, Audio
display(Audio(response_audio[1], rate=response_audio[0], autoplay=True))
else:
import sounddevice as sd
sd.play(response_audio[1], samplerate=response_audio[0])
sd.wait()
📚 Documentation
Limitations
This early prototype has several areas for improvement:
- Limited training data (~15k samples).
- Basic prompt/chat template structure.
- Opportunity to optimize training hyperparameters.
- Potential for multi - turn conversation capabilities.
Potential Limitation: This type of model quickly fills up context window, making smaller models generally more practical for implementation.
Acknowledgments & Citations
This model builds on the following open - source projects:
- [OuteAI/OuteTTS - 0.3 - 500M](https://huggingface.co/OuteAI/OuteTTS - 0.3 - 500M) - Base model
- [gruhit - patel/alpaca_speech_instruct](https://huggingface.co/datasets/gruhit - patel/alpaca_speech_instruct) - Initial dataset
- [hexgrad/Kokoro - 82M](https://huggingface.co/hexgrad/Kokoro - 82M) - TTS generation
- OpenAI Whisper - Speech verification
- Unsloth - Training optimization
📄 License
This project is licensed under the cc - by - sa - 4.0 license.


