Canary-tts-150m Open Source Japanese Speech Synthesis System - Free Deployment, Supports Prompt Word to Regulate Sound Quality

Canary Tts 150m

Developed by 2121-8

Japanese TTS speech synthesis system trained based on llm-jp/llm-jp-3-150m-instruct3, supports audio quality adjustment via prompts

Speech Synthesis

PyTorch

Supports Multiple Languages#Japanese Speech Synthesis #Prompt-Controlled TTS #Lightweight Speech Generation

Downloads 36

Release Time : 4/22/2025

Model Overview

Experimental Japanese speech synthesis model, utilizing Parler-TTS prompt architecture and XCodec2 audio decoder, allows pitch and background noise adjustment through control prompts

Model Features

Prompt Control

Fine-tune audio quality by modifying control prompts and reading prompts

Lightweight Design

150M parameter scale suitable for deployment in resource-constrained environments

High-Quality Audio Output

Uses XCodec2 audio decoder to ensure speech quality

Model Capabilities

Japanese Speech Synthesis

Pitch Adjustment

Background Noise Control

Text-to-Speech

Use Cases

Voice Interaction

Virtual Assistant

Provides natural speech output for Japanese virtual assistants

Generates speech with emotional characteristics

Content Creation

Audio Content Generation

Automatically converts Japanese text to speech

Supports speech output with different tones and intonations

🚀 Canary-TTS-150M

Canary-TTS-150M is a Text-to-Speech (TTS) model trained based on llm-jp/llm-jp-3-150m-instruct3. It adopts the same prompt method as Parler-TTS, allowing for fine-grained control of voice quality by changing control prompts and reading prompts. This model is an experimental model created for training Canary-TTS 0.5B, so the use of Canary-TTS 0.5B is recommended.

📑 Canary-TTS Index

📦 Model

Property	Details
Model Name	2121‑8/canary‑tts‑150m
Base Model	llm-jp/llm-jp-3-150m-instruct3
Audio Decoder	HKUSTAudio/xcodec2

✨ Features

Control pitch and noise through control prompts.
Read out text using reading prompts.
Built on the codebase of Parler-TTS and XCodec2.
Based on llama, enabling the transfer of LLM technologies.

📦 Installation

pip install torch torchvision torchaudio
pip install git+https://github.com/getuka/canary-tts.git

💻 Usage Examples

Basic Usage

import torch, torchaudio
from transformers import AutoModelForCausalLM, AutoTokenizer
from canary_tts.xcodec2.modeling_xcodec2 import XCodec2Model
from rubyinserter import add_ruby

tokenizer = AutoTokenizer.from_pretrained("2121-8/canary-tts-150m")
model = AutoModelForCausalLM.from_pretrained("2121-8/canary-tts-150m", device_map="auto", torch_dtype=torch.bfloat16)
codec = XCodec2Model.from_pretrained("HKUSTAudio/xcodec2")

description = "A man voice, with a very hight pitch, speaks in a monotone manner. The recording quality is very noises and close-sounding, indicating a good or excellent audio capture."
prompt = 'こんにちは。お元気ですか？'

prompt = add_ruby(prompt)
chat = [
    {"role": "system", "content": description},
    {"role": "user", "content": prompt}
]
tokenized_input = tokenizer.apply_chat_template(chat, add_generation_prompt=True, tokenize=True, return_tensors="pt").to(model.device)

with torch.no_grad():
    output = model.generate(
        tokenized_input,
        max_new_tokens=256,
        top_p=0.95,
        temperature=0.7,
        repetition_penalty=1.05,
    )[0]

audio_tokens = output[len(tokenized_input[0]):]
output_audios = codec.decode_code(audio_tokens.unsqueeze(0).unsqueeze(0).cpu())
torchaudio.save("sample.wav", src=output_audios[0].cpu(), sample_rate=16000)

🎵 Sample Audio

🙏 Acknowledgements

Parler-TTS Community
XCodec2 Developers

📄 License

CC BY‑NC 4.0

📜 Credit

Audio decoder

Repository: HKUSTAudio/xcodec2
License: CC BY‑NC 4.0

Model

Repository: sbintuitions/sarashina2.2-0.5b-instruct-v0.1
License: MIT License

Copyright and Disclaimer

Please comply with the following conditions:

⚠️ Important Note

The creator makes no guarantees regarding the accuracy, legality, or appropriateness of the results obtained from using this model.

When using this model, users must comply with all applicable laws and regulations. All responsibilities arising from the generated content shall be borne by the user.

The creator of this repository and the model shall not be held liable for any copyright infringement or other legal issues.

In the event of a copyright issue, the problematic resources or data will be promptly deleted.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご