Fish-speech-1.5 Open-source Text-to-Speech Model - Multilingual Voice Generation Based on Over a Million Hours of Data

Home

Fish Speech 1.5

Developed by jkeisling

Leading text-to-speech (TTS) model trained on over 1 million hours of multilingual audio data

Speech Synthesis

Safetensors

Supports Multiple Languages#Multilingual TTS #Million-hour Training #Non-commercial Use

Downloads 194

Release Time : 12/7/2024

Model Overview

Fish Speech V1.5 is a high-performance multilingual text-to-speech model supporting 13 languages, specially optimized for compatibility with the Rust ecosystem.

Model Features

Multilingual Support

Supports 13 languages including Chinese, English, Japanese, and other major languages

Large-scale Training

Trained on over 1 million hours of multilingual audio data

Rust Ecosystem Compatibility

Specially optimized for the fish-speech.rs framework and Candle.rs

Weight Security Format

Uses .safetensors format for weight storage to enhance security

Model Capabilities

High-quality text-to-speech

Multilingual speech synthesis

Supports conversion between 13 languages

Use Cases

Speech Synthesis

Multilingual Voice Assistant

Provides natural voice output for multilingual applications

High-quality, natural-sounding speech synthesis

Audiobook Generation

Automatically converts text into audiobooks in multiple languages

Supports multiple languages and pronunciation styles

🚀 Fish Speech V1.5 Reformatted

This is a reformatted version of the official Fish Speech v1.5 weights, designed to work seamlessly with fish-speech.rs.

🚀 Quick Start

This is a reformat of the official Fish Speech v1.5 weights to work with fish-speech.rs.

✨ Features

I've made the following changes, for better compatibility with Candle.rs and the HuggingFace ecosystem:

DualAR transformer weights converted to .safetensors for safety and easier loading
Tokenizer ported from Tiktoken format and custom wrapper to HuggingFace Tokenizers for easier downstream use
VQGAN is unchanged from v1.4, so copying the weight-norm merged safetensors and FireflyGAN config from my previous conversion

⚠️ Important Note

Please respect the original license and do not use this model for commercial purposes. You can support Fish Audio by using the official API at fish.audio.

These weights WILL NOT work with the official Fish Speech inference code!

📚 Documentation

Fish Speech V1.5 is a leading text-to-speech (TTS) model trained on more than 1 million hours of audio data in multiple languages.

Supported Languages

Language	Hours of Training Data
English (en)	>300k hours
Chinese (zh)	>300k hours
Japanese (ja)	>100k hours
German (de)	~20k hours
French (fr)	~20k hours
Spanish (es)	~20k hours
Korean (ko)	~20k hours
Arabic (ar)	~20k hours
Russian (ru)	~20k hours
Dutch (nl)	<10k hours
Italian (it)	<10k hours
Polish (pl)	<10k hours
Portuguese (pt)	<10k hours

Please refer to Fish Speech Github for more info.
Demo available at Fish Audio.

📄 License

This model is permissively licensed under the BY-CC-NC-SA-4.0 license.

Citation

If you found this repository useful, please consider citing this work:

@misc{fish-speech-v1.4,
      title={Fish-Speech: Leveraging Large Language Models for Advanced Multilingual Text-to-Speech Synthesis}, 
      author={Shijia Liao and Yuxuan Wang and Tianyu Li and Yifan Cheng and Ruoyi Zhang and Rongzhi Zhou and Yijin Xing},
      year={2024},
      eprint={2411.01156},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/2411.01156}, 
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご