Orpheus Free Text-to-Speech Model - Open-Source Deployment for Natural Emotional Speech Synthesis

Orpheus 3b Chinese FT Q8 0.gguf

Developed by lex-au

Orpheus is a high-performance text-to-speech model, fine-tuned specifically for natural emotional speech synthesis. This repository hosts the 8-bit quantized version of the 3-billion-parameter model, optimizing efficiency while maintaining high-quality output.

Speech Synthesis Supports Multiple LanguagesOpen Source License:Apache-2.0 #Emotional Speech Synthesis #8-bit Quantized TTS #Multi-voice Support

Downloads 58

Release Time : 4/18/2025

Model Overview

Orpheus-3b-FT-Q8_0 is a 3-billion-parameter text-to-speech model that converts text input into natural speech, supporting multiple voice tones and emotional expressions. The model has been quantized to 8-bit (Q8_0) format for efficient inference, enabling it to run on consumer-grade hardware.

Model Features

Multi-voice Support

Supports 2 distinctive voice options: 'Changle' (gentle female voice) and 'Baizhi' (bright female voice)

Emotional Expression

Supports enhanced expressiveness through inserted tags, such as laughter, sighs, and other emotional tags

Efficient Inference

8-bit quantization (Q8_0) optimizes inference efficiency, enabling operation on consumer-grade hardware

High-Quality Audio Output

Generates 24kHz mono high-quality audio, fine-tuned for conversational naturalness

Model Capabilities

Text-to-Speech

Emotional Speech Synthesis

Multi-voice Speech Generation

Use Cases

Voice Interaction Applications

Smart Customer Service Voice

Generates natural, emotionally rich voice responses for smart customer service systems

Enhances user experience, making interactions more natural

Audio Content Creation

Automatically generates audiobook or podcast content

Efficiently produces high-quality voice content

Assistive Technology

Voice Assistance Tools

Provides text-to-speech functionality for visually impaired users

Delivers natural and smooth voice output

🚀 Orpheus-3b-Chinese-FT-Q8_0

Orpheus-3b-Chinese-FT-Q8_0 is a quantized Text-to-Speech model. It can convert text into natural-sounding speech, supporting multiple voices and emotional expressions. This version is optimized for efficiency while maintaining high-quality output.

🚀 Quick Start

Download this quantized model from lex-au's Orpheus-FASTAPI collection.
Load the model in your preferred inference server and start the server.
Clone the Orpheus-FastAPI repository:

git clone https://github.com/Lex-au/Orpheus-FastAPI.git
cd Orpheus-FastAPI

Configure the FastAPI server to connect to your inference server by setting the ORPHEUS_API_URL environment variable.
Follow the complete installation and setup instructions in the repository README.

✨ Features

2 distinct voice options with different characteristics.
Support for emotion tags like laughter, sighs, etc.
Optimized for CUDA acceleration on RTX GPUs.
Produces high-quality 24kHz mono audio.
Fine-tuned for conversational naturalness.

📦 Installation

Compatible Inference Servers

This quantized model can be loaded into any of these LLM inference servers:

GPUStack - GPU optimized LLM inference server (My pick) - supports LAN/WAN tensor split parallelization.
LM Studio - Load the GGUF model and start the local server.
llama.cpp server - Run with the appropriate model parameters.
Any compatible OpenAI API-compatible server.

💻 Usage Examples

Basic Usage

This model is designed to be used with an LLM inference server that connects to the Orpheus-FastAPI frontend, which provides both a web UI and OpenAI-compatible API endpoints.

Available Voices

The model supports 2 different voices:

长乐: Female, Mandarin, gentle.
白芷: Female, Mandarin, clear.

Emotion Tags

You can add expressiveness to speech by inserting tags:

<laugh>, <chuckle>: For laughter sounds.
<sigh>: For sighing sounds.
<cough>, <sniffle>: For subtle interruptions.
<groan>, <yawn>, <gasp>: For additional emotional expression.

📚 Documentation

Model Description

Orpheus-3b-FT-Q8_0 is a 3 billion parameter Text-to-Speech model that converts text inputs into natural-sounding speech with support for multiple voices and emotional expressions. The model has been quantized to 8-bit (Q8_0) format for efficient inference, making it accessible on consumer hardware.

🔧 Technical Details

Property	Details
Model Type	Specialised token-to-audio sequence model
Parameters	~3 billion
Quantisation	8-bit (GGUF Q8_0 format)
Audio Sample Rate	24kHz
Input	Text with optional voice selection and emotion tags
Output	High-quality WAV audio
Language	Mandarin
Hardware Requirements	CUDA-compatible GPU (recommended: RTX series)
Integration Method	External LLM inference server + Orpheus-FastAPI frontend

⚠️ Limitations

Best performance achieved on CUDA-compatible GPUs.
Generation speed depends on GPU capability.

📄 License

This model is available under the Apache License 2.0.

📚 Citation & Attribution

The original Orpheus model was created by Canopy Labs. This repository contains a quantized version optimized for use with the Orpheus-FastAPI server.

If you use this quantized model in your research or applications, please cite:

@misc{orpheus-tts-2025,
  author = {Canopy Labs},
  title = {Orpheus-3b-0.1-ft: Text-to-Speech Model},
  year = {2025},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/canopylabs/orpheus-3b-0.1-ft}}
}

@misc{orpheus-quantised-2025,
  author = {Lex-au},
  title = {Orpheus-3b-FT-Q8_0: Quantised TTS Model with FastAPI Server},
  note = {GGUF quantisation of canopylabs/orpheus-3b-0.1-ft},
  year = {2025},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/lex-au/Orpheus-3b-FT-Q8_0.gguf}}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご