Open-source Japanese TTS model canary-tts-nano-150m-beta - Streamlined prompts for efficient parameter configuration

Canary Tts Nano 150m Beta

Developed by 2121-8

A Japanese TTS base model trained on llm-jp/llm-jp-3-150m-instruct3, achieving efficient parameter configuration through streamlined control prompts

Speech Synthesis

PyTorch

Supports Multiple LanguagesOpen Source License:MIT #Japanese speech synthesis #Streamlined parameter architecture #LLM migration adaptation

Downloads 31

Release Time : 4/27/2025

Model Overview

A lightweight model focused on Japanese text-to-speech, with control prompts removed for subsequent fine-tuning, core architecture based on llama for technical migration

Model Features

Streamlined parameter design

Achieves efficient parameter configuration by removing the control prompt layer

LLM-compatible architecture

Based on llama architecture for easy migration of large language model technologies

Audio quality optimization

Utilizes OuteAI's efficient audio decoder for voice synthesis

Model Capabilities

Japanese speech synthesis

Random voice generation

Specified voice fine-tuning

Use Cases

Voice interaction

Virtual assistant voice

Provides basic speech synthesis capabilities for Japanese virtual assistants

Basic audio quality is rough but can be improved through fine-tuning

Content creation

Audio content generation

Automatically converts Japanese text into speech content

Requires subsequent fine-tuning for better results

🚀 Canary-TTS-0.5B

Canary-TTS-0.5B is a Text-to-Speech (TTS) base model trained on the foundation of llm-jp/llm-jp-3-150m-instruct3. By removing control prompts for potential further training, the number of parameters has been reduced.

🚀 Quick Start

Canary-TTS Index

Quick Index

Model
Features
Installation
Usage
- Generate Random Voice
- Generate Voice of Specific Speaker
Sample Voices
Acknowledgements
License

✨ Features

Parameter Reduction: By removing control prompts for further training, the number of parameters has been reduced.
Text Reading: Support text reading with reading prompts.
Code Foundation: Built on the code of Parler‑TTS and WavTokenizer.
LLM Technology Transfer: Based on llama, enabling the transfer of LLM technology.

📦 Installation

pip install torch torchvision torchaudio
pip install git+https://github.com/getuka/canary-tts.git

💻 Usage Examples

Basic Usage

import torch, torchaudio
from transformers import AutoModelForCausalLM, AutoTokenizer
from canary_tts.wavtokenizer import WavDecoder
from rubyinserter import add_ruby

tokenizer = AutoTokenizer.from_pretrained("2121-8/canary-tts-0.5b")
model = AutoModelForCausalLM.from_pretrained("2121-8/canary-tts-0.5b", device_map="auto", torch_dtype=torch.bfloat16)
cache_dir = os.path.join(os.path.join(os.path.expanduser("~"), ".cache"),"outeai", "tts", "wavtokenizer_75_token_interface")
decoder = WavDecoder.from_pretrained(os.path.join(cache_dir, 'decoder')).to(model.device)

prompt = 'こんにちは。お元気ですか？'

prompt = add_ruby(prompt)
chat = [
    {"role": "user", "content": prompt}
]
tokenized_input = tokenizer.apply_chat_template(chat, add_generation_prompt=True, tokenize=True, return_tensors="pt").to(model.device)

with torch.no_grad():
    output = model.generate(
        tokenized_input,
        max_new_tokens=512,
        do_sample=True,
        temperature=0.9,
    )[0]

audio_tokens = output[len(tokenized_input[0]):]
features = decoder.codes_to_features(audio_tokens.unsqueeze(0).unsqueeze(0))
output_audios = decoder(features, bandwidth_id=torch.tensor([0], device=features.device))
torchaudio.save("sample.wav", src=output_audios.cpu(), sample_rate=24000)

Advanced Usage

Sample TTS Model

📚 Documentation

Model

Property	Details
Model Name	2121‑8/canary‑tts‑0.5b
Base Model	llm-jp/llm-jp-3-150m-instruct3
Audio Decoder	OuteAI/wavtokenizer-large-75token-interface

Sample Voices

Note: As this is a base model, the voice quality is rough. However, this issue can be resolved by further training with a single speaker.

Acknowledgements

Parler‑TTS Community
OuteAI Developers
Wavtokenizer Developers

License

This project is licensed under the MIT License.

Credit

Audio decoder

Repository: OuteAI/wavtokenizer-large-75token-interface
License: MIT License

Model

Repository: sbintuitions/sarashina2.2-0.5b-instruct-v0.1
License: MIT License

Copyright and Disclaimer

Please comply with the following conditions:

Disclaimer of Appropriateness: The creator makes no warranties regarding the accuracy, legality, or appropriateness of the results obtained from using this model.
User Responsibility: When using this model, please comply with all applicable laws and regulations. All responsibilities arising from the generated content shall be borne by the user.
Creator's Disclaimer: The creator of this repository and the model shall not be liable for any copyright infringement or other legal issues.
Response to Deletion Requests: In the event of a copyright issue, the problematic resources or data will be promptly deleted.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご