TGiangVoice's Spark-TTS Open-source Text-to-Speech System - High-accuracy, Natural and Fluent Speech Synthesis

Tgiangvoice

Developed by thinhkosay

Spark-TTS is an advanced text-to-speech system that leverages the powerful capabilities of large language models (LLMs) to achieve highly accurate and naturally fluent speech synthesis.

Speech Synthesis

Safetensors

#Vietnamese speech synthesis #LLM-powered #Zero-shot cloning

Downloads 16

Release Time : 4/19/2025

Model Overview

This system is designed for efficiency, flexibility, and robust performance, suitable for both research and production purposes. The model is trained on the viVoice Vietnamese dataset.

Model Features

High-quality speech synthesis

Utilizes large language models to achieve highly accurate and naturally fluent speech synthesis

Efficient and flexible

Designed for efficiency and flexibility, suitable for both research and production purposes

Vietnamese language support

Speech synthesis model specifically optimized for Vietnamese

Model Capabilities

Vietnamese text-to-speech

Voice cloning

Speech synthesis

Use Cases

Speech applications

Voice assistants

Provides natural speech output for Vietnamese voice assistants

Generates naturally fluent Vietnamese speech

Audiobooks

Converts Vietnamese text into audiobooks

High-quality speech output

Voice cloning

Clones specific voices based on a few samples

Generates output similar to the reference voice

🚀 Spark TTS Vietnamese

Spark-TTS is an advanced text-to-speech system leveraging large language models (LLMs) for highly accurate and natural-sounding voice synthesis. It's efficient, flexible, and powerful for both research and production.

🚀 Quick Start

Spark-TTS is an advanced text-to-speech system that harnesses the capabilities of large language models (LLMs) to achieve highly accurate and natural-sounding voice synthesis. It is engineered to be efficient, flexible, and powerful, making it suitable for both research and production purposes. This model is trained on the viVoice Vietnamese dataset.

📦 Installation

First, install the required packages:

pip install --upgrade transformers accelerate

💻 Usage Examples

Basic Usage

We have customized the code so you can perform inference using the Hugging Face Transformer library without installing anything else.

from transformers import AutoProcessor, AutoModel, AutoTokenizer
import soundfile as sf
import torch
import numpy as np

device = "cuda"
model_id = "DragonLineageAI/Vi-SparkTTS-0.5B"
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = AutoModel.from_pretrained(model_id, trust_remote_code=True).eval()
processor.model = model
 
prompt_audio_path = "path_to_audio_path" # CHANGE TO YOUR ACTUAL PATH
prompt_transcript = "text corresponding to prompt audio" # Optional
text_input = "xin chào mọi người chúng tôi là Nguyễn Công Tú Anh và Chu Văn An đến từ dragonlineageai"
 
inputs = processor(
    text=text_input.lower(),
    prompt_speech_path=prompt_audio_path,
    prompt_text=prompt_transcript,
    return_tensors="pt"
).to(device)
global_tokens_prompt = inputs.pop("global_token_ids_prompt", None)
 
with torch.no_grad():
    output_ids = model.generate(
        **inputs,
        max_new_tokens=3000,
        do_sample=True,
        temperature=0.8,
        top_k=50,
        top_p=0.95,
        eos_token_id=processor.tokenizer.eos_token_id,  
        pad_token_id=processor.tokenizer.pad_token_id  
    )
       
output_clone = processor.decode(
    generated_ids=output_ids,
    global_token_ids_prompt=global_tokens_prompt,
    input_ids_len=inputs["input_ids"].shape[-1]
)
 
sf.write("output_cloned.wav", output_clone["audio"], output_clone["sampling_rate"])

Advanced Usage

You can fine-tune this model with any dataset to improve quality or train on a new language. training code

📄 License

This project is licensed under the CC BY-NC-ND 4.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご