EraX-Smile-UnixSex-F5 Open-Source Vietnamese Text-to-Speech Model

Erax Smile UnixSex F5

Developed by erax-ai

Vietnamese text-to-speech model based on F5-TTS architecture, supporting neutral-style voice cloning

Speech Synthesis Other#Vietnamese Voice Cloning #Zero-shot Voice Conversion #Neutral Style TTS

Downloads 120

Release Time : 4/18/2025

Model Overview

This is a Vietnamese text-to-speech model based on the F5-TTS architecture, fine-tuned with over 2,700,000 Vietnamese samples, supporting neutral-style voice cloning and zero-shot voice cloning capabilities.

Model Features

Vietnamese Support

Optimized specifically for Vietnamese, trained with a large number of Vietnamese samples

Voice Cloning

Supports zero-shot voice cloning, capable of generating similar voices based on reference audio

Multi-style Support

Supports female, male, and neutral-style voice generation

Open-source Code

Provides complete open-source implementation code for easy research and secondary development

Model Capabilities

Vietnamese Text-to-Speech

Voice Style Cloning

Neutral Voice Generation

Multi-style Voice Synthesis

Use Cases

Voice Synthesis

News Broadcasting

Generates natural and fluent Vietnamese news broadcast voices

Refer to the audio samples provided on the model page

Audiobooks

Generates narration voices for Vietnamese e-books

Voice Cloning

Personalized Voice Assistant

Clones specific individuals' voices to create personalized voice assistants

🚀 EraX-Smile-UnixSex-F5: Giving F5-TTS a Unisex Vietnamese Twist (with Online Zero-Shot Voice Cloning!) ✨

This model is based on the excellent F5-TTS architecture (arXiv:2410.06885). To train it on the beautiful nuances of Vietnamese, we fine - tuned it using a large dataset: over 2,700,000 Vietnamese - only samples. This includes a mix of public data and a significant 1000 - hour private dataset (we're very grateful for the usage rights! 🙏).

The codes are fully open - source at https://github.com/EraX-AI/viF5TTS/tree/main/src

🚀 Quick Start

Installation

# Ubuntu: sudo apt install ffmpeg
# Windows please refer to https://www.geeksforgeeks.org/how-to-install-ffmpeg-on-windows/
# Download out Github codes

pip install numpy==1.26
pip install matplotlib
pip install vinorm
pip install f5-tts
pip install librosa

Usage Examples

Basic Usage

import os
os.environ["CUDA_VISIBLE_DEVICES"] =  "0" # Tell it which GPU to use (or ignore if you're CPU-bound and patient!)

from vinorm import TTSnorm # Gotta normalize that Vietnamese text first
from f5tts_wrapper import F5TTSWrapper # Our handy wrapper class

# --- Config ---
# Path to the model checkpoint you downloaded from *this* repo
# MAKE SURE this path points to the actual .pth or .ckpt or safetensors file!
eraX_ckpt_path = "path/to/your/downloaded/EraX-Smile-UnixSex-F5/models/model_42000.safetensors" # <-- CHANGE THIS!

# Path to the voice you want to clone
ref_audio_path = "path/to/your/reference_voice.wav" # <-- CHANGE THIS!

# Path to the vocab file from this repo
vocab_file = "path/to/your/downloaded/EraX-Smile-UnixSex-F5/models/vocab.txt" # <-- CHANGE THIS!

# Where to save the generated sound
output_dir = "output_audio"

# --- Texts ---
# Text matching the reference audio (helps the model learn the voice). Please make sure it match with the referrence audio!
ref_text = "Thậm chí không ăn thì cũng có cảm giác rất là cứng bụng, chủ yếu là cái phần rốn...trở lên. Em có cảm giác khó thở, và ngủ cũng không ngon, thường bị ợ hơi rất là nhiều"

# The text you want the cloned voice to speak
text_to_generate = "Trong khi đó, tại một chung cư trên địa bàn P.Vĩnh Tuy (Q.Hoàng Mai), nhiều người sống trên tầng cao giật mình khi thấy rung lắc mạnh nên đã chạy xuống sảnh tầng 1. Cư dân tại đây cho biết, họ chưa bao giờ cảm thấy ảnh hưởng của động đất mạnh như hôm nay."

# --- Let's Go! ---
print("Initializing the TTS engine... (Might take a sec)")
tts = F5TTSWrapper(
    model_name="F5TTS_v1_Base",
    vocoder_name="vocos",
    ckpt_path=eraX_ckpt_path,
    vocab_file=vocab_file,
    use_ema=True,
    target_sample_rate=24000,
    n_mel_channels = 100,
    hop_length = 256,
    win_length = 1024,
    n_fft      = 1024,
    ode_method = 'euler',
)

# Normalize the reference text (makes it easier for the model)
ref_text_norm = TTSnorm(ref_text)

# Prepare the output folder
os.makedirs(output_dir, exist_ok=True)

print("Processing the reference voice...")
# Feed the model the reference voice ONCE
# Provide ref_text for better quality, or set ref_text="" to use Whisper for auto-transcription (if installed)
tts.preprocess_reference(
    ref_audio_path=ref_audio_path,
    ref_text=ref_text_norm,
    clip_short=True # Keeps reference audio to a manageable length (~12s)
)
print(f"Reference audio duration used: {tts.get_current_audio_length():.2f} seconds")

# --- Generate New Speech ---
print("Generating new speech with the cloned voice...")

# Normalize the text we want to speak
text_norm = TTSnorm(text_to_generate)

# You can generate multiple sentences easily
# Just add more normalized strings to this list
sentences = [text_norm]

for i, sentence in enumerate(sentences):
    output_path = os.path.join(output_dir, f"generated_speech_{i+1}.wav")

    # THE ACTUAL GENERATION HAPPENS HERE!
    tts.generate(
        text=sentence,
        output_path=output_path,
        nfe_step=32,               # Denoising steps. More = slower but potentially better? (Default: 32)
        cfg_strength=3.0,          # How strongly to stick to the reference voice style? (Default: 2.0)
        speed=1.0,                 # Make it talk faster or slower (Default: 1.0)
        cross_fade_duration=0.12,  # Smooths transitions if text is split into chunks (Default: 0.15)
        sway_sampling_coef=-1
    )
    print(f"Boom! Audio saved to: {output_path}")

print("\nAll done! Check your output folder.")

✨ Features

Based on F5 - TTS: Built upon the F5 - TTS architecture, leveraging its capabilities.
Vietnamese Fine - Tuning: Fine - tuned with over 2,700,000 Vietnamese - only samples, including public and private data.
Voice Cloning: Supports zero - shot voice cloning for both male and female voices.
Multiple Model Checkpoints: The repo contains 4 models (model_42000.safetensors, model_45000.safetensors, model_48000.safetensors, overfit.safetensors) for you to try.

📚 Documentation

Model Details

Property	Details
Model Type	EraX - Smile - UnixSex - F5, a text - to - speech model based on F5 - TTS
Training Data	Over 2,700,000 Vietnamese - only samples, including public data and a 1000 - hour private dataset

Usage Notes

For the full Web interface and control with Gradio, please clone and use the original repository of [F5 - TTS Github](https://github.com/SWivid/F5 - TTS).
We use the library from [Vinorm Team](https://github.com/v - nhandt21/Vinorm) for Vietnamese text normalization.

Future Plans

- [X] ⭐ Release checkpoints for Vietnamese male
- [ ] 📝 Codes for real - time TTS streaming
- [ ] 🔥 Release Piper - based model that can run on ...iPhone, Android, Rasberry Pi 4 or Browser 🔥

Important Note on Responsible Use

⚠️ Important Note

Voice cloning technology is powerful and comes with significant ethical responsibilities. This model is intended for creative purposes, accessibility tools, personal projects, and applications where consent is explicit and ethical considerations are prioritized. We strongly condemn and strictly prohibit the use of this model for any malicious or unethical purposes, including but not limited to creating non - consensual deepfakes, generating misinformation, harassment, or any form of criminal activity. By using this model, you agree to do so responsibly and ethically and are solely responsible for the content you generate.

License

We're using the MIT License for our codes, following in the footsteps of giants like Whisper. However, the base F5 - TTS model was pretrained with the Emilia dataset which is under BY - NC 4.0 license (non - commercial).

Citation

If you find this model helpful, a star ⭐ on our GitHub repo would be appreciated. And if you're writing a research paper, you can use the following bibtex snippet:

@misc{EraXSmileF5_2024,
  author       = {Nguyễn Anh Nguyên nguyen@erax.ai and The EraX Team},
  title        = {EraX-Smile-UnixSex-F5: Người Việt sành tiếng Việt.},
  year         = {2025},
  publisher    = {Hugging Face},
  journal      = {Hugging Face Model Hub},
  howpublished = {\url{https://github.com/EraX-AI/viF5TTS}}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご