Vi-SparkTTS-0.5B Open-Source Text-to-Speech System - High-Precision, Natural and Fluent Speech Synthesis

Vi SparkTTS 0.5B

Developed by DragonLineageAI

Spark-TTS is an advanced text-to-speech system that leverages the powerful capabilities of large language models (LLMs) to achieve high-precision and natural-sounding speech synthesis.

Speech Synthesis

Safetensors

#Vietnamese speech synthesis #LLM-powered #Zero-shot cloning

Downloads 3,804

Release Time : 3/31/2025

Model Overview

A high-quality text-to-speech system trained on the viVoice Vietnamese dataset, designed for both research and production environments with efficiency, flexibility, and robust functionality.

Model Features

High-quality speech synthesis

Utilizes large language models to achieve high-precision and natural-sounding speech synthesis

Professional dataset training

Trained on the viVoice Vietnamese professional dataset

Dual-purpose for research and production

Designed for both research and production environments, combining efficiency and flexibility

Model Capabilities

Vietnamese text-to-speech

Voice cloning

Speech synthesis

Use Cases

Speech synthesis applications

Voice assistants

Provides natural voice output for Vietnamese voice assistants

Highly natural voice output

Audiobooks

Converts Vietnamese text into audiobooks

Smooth and natural reading effects

🚀 Spark TTS Vietnamese

Spark-TTS is an advanced text-to-speech system that leverages the capabilities of large language models (LLMs) to deliver highly accurate and natural-sounding voice synthesis. It is engineered to be efficient, flexible, and powerful, suitable for both research and production purposes. This model is trained on the viVoice Vietnamese dataset.

🚀 Quick Start

📦 Installation

First, install the required packages:

pip install --upgrade transformers accelerate

💻 Usage Examples

🔍 Basic Usage

We have customized the code so you can perform inference using the Hugging Face Transformer library without any additional installations.

from transformers import AutoProcessor, AutoModel, AutoTokenizer
import soundfile as sf
import torch
import numpy as np

device = "cuda"
model_id = "DragonLineageAI/Vi-SparkTTS-0.5B"
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = AutoModel.from_pretrained(model_id, trust_remote_code=True).eval()
processor.model = model
 
prompt_audio_path = "path_to_audio_path" # CHANGE TO YOUR ACTUAL PATH
prompt_transcript = "text corresponding to prompt audio" # Optional
text_input = "xin chào mọi người chúng tôi là Nguyễn Công Tú Anh và Chu Văn An đến từ dragonlineageai"
 
inputs = processor(
    text=text_input.lower(),
    prompt_speech_path=prompt_audio_path,
    prompt_text=prompt_transcript,
    return_tensors="pt"
).to(device)
global_tokens_prompt = inputs.pop("global_token_ids_prompt", None)
 
with torch.no_grad():
    output_ids = model.generate(
        **inputs,
        max_new_tokens=3000,
        do_sample=True,
        temperature=0.8,
        top_k=50,
        top_p=0.95,
        eos_token_id=processor.tokenizer.eos_token_id,  
        pad_token_id=processor.tokenizer.pad_token_id  
    )
       
output_clone = processor.decode(
    generated_ids=output_ids,
    global_token_ids_prompt=global_tokens_prompt,
    input_ids_len=inputs["input_ids"].shape[-1]
)
 
sf.write("output_cloned.wav", output_clone["audio"], output_clone["sampling_rate"])

🔍 Advanced Usage

You can fine-tune this model with any dataset to enhance its quality or train it on a new language. Check out the training code.

📄 License

This project is licensed under the CC BY-NC-ND 4.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご