Model Overview
Model Features
Model Capabilities
Use Cases
🚀 EraX-Smile-UnixSex-F5: Giving F5-TTS a Unisex Vietnamese Twist (with Online Zero-Shot Voice Cloning!) ✨
This model is based on the excellent F5-TTS architecture (arXiv:2410.06885). To train it on the beautiful nuances of Vietnamese, we fine - tuned it using a large dataset: over 2,700,000 Vietnamese - only samples. This includes a mix of public data and a significant 1000 - hour private dataset (we're very grateful for the usage rights! 🙏).
The codes are fully open - source at https://github.com/EraX-AI/viF5TTS/tree/main/src
🚀 Quick Start
Installation
# Ubuntu: sudo apt install ffmpeg
# Windows please refer to https://www.geeksforgeeks.org/how-to-install-ffmpeg-on-windows/
# Download out Github codes
pip install numpy==1.26
pip install matplotlib
pip install vinorm
pip install f5-tts
pip install librosa
Usage Examples
Basic Usage
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0" # Tell it which GPU to use (or ignore if you're CPU-bound and patient!)
from vinorm import TTSnorm # Gotta normalize that Vietnamese text first
from f5tts_wrapper import F5TTSWrapper # Our handy wrapper class
# --- Config ---
# Path to the model checkpoint you downloaded from *this* repo
# MAKE SURE this path points to the actual .pth or .ckpt or safetensors file!
eraX_ckpt_path = "path/to/your/downloaded/EraX-Smile-UnixSex-F5/models/model_42000.safetensors" # <-- CHANGE THIS!
# Path to the voice you want to clone
ref_audio_path = "path/to/your/reference_voice.wav" # <-- CHANGE THIS!
# Path to the vocab file from this repo
vocab_file = "path/to/your/downloaded/EraX-Smile-UnixSex-F5/models/vocab.txt" # <-- CHANGE THIS!
# Where to save the generated sound
output_dir = "output_audio"
# --- Texts ---
# Text matching the reference audio (helps the model learn the voice). Please make sure it match with the referrence audio!
ref_text = "Thậm chí không ăn thì cũng có cảm giác rất là cứng bụng, chủ yếu là cái phần rốn...trở lên. Em có cảm giác khó thở, và ngủ cũng không ngon, thường bị ợ hơi rất là nhiều"
# The text you want the cloned voice to speak
text_to_generate = "Trong khi đó, tại một chung cư trên địa bàn P.Vĩnh Tuy (Q.Hoàng Mai), nhiều người sống trên tầng cao giật mình khi thấy rung lắc mạnh nên đã chạy xuống sảnh tầng 1. Cư dân tại đây cho biết, họ chưa bao giờ cảm thấy ảnh hưởng của động đất mạnh như hôm nay."
# --- Let's Go! ---
print("Initializing the TTS engine... (Might take a sec)")
tts = F5TTSWrapper(
model_name="F5TTS_v1_Base",
vocoder_name="vocos",
ckpt_path=eraX_ckpt_path,
vocab_file=vocab_file,
use_ema=True,
target_sample_rate=24000,
n_mel_channels = 100,
hop_length = 256,
win_length = 1024,
n_fft = 1024,
ode_method = 'euler',
)
# Normalize the reference text (makes it easier for the model)
ref_text_norm = TTSnorm(ref_text)
# Prepare the output folder
os.makedirs(output_dir, exist_ok=True)
print("Processing the reference voice...")
# Feed the model the reference voice ONCE
# Provide ref_text for better quality, or set ref_text="" to use Whisper for auto-transcription (if installed)
tts.preprocess_reference(
ref_audio_path=ref_audio_path,
ref_text=ref_text_norm,
clip_short=True # Keeps reference audio to a manageable length (~12s)
)
print(f"Reference audio duration used: {tts.get_current_audio_length():.2f} seconds")
# --- Generate New Speech ---
print("Generating new speech with the cloned voice...")
# Normalize the text we want to speak
text_norm = TTSnorm(text_to_generate)
# You can generate multiple sentences easily
# Just add more normalized strings to this list
sentences = [text_norm]
for i, sentence in enumerate(sentences):
output_path = os.path.join(output_dir, f"generated_speech_{i+1}.wav")
# THE ACTUAL GENERATION HAPPENS HERE!
tts.generate(
text=sentence,
output_path=output_path,
nfe_step=32, # Denoising steps. More = slower but potentially better? (Default: 32)
cfg_strength=3.0, # How strongly to stick to the reference voice style? (Default: 2.0)
speed=1.0, # Make it talk faster or slower (Default: 1.0)
cross_fade_duration=0.12, # Smooths transitions if text is split into chunks (Default: 0.15)
sway_sampling_coef=-1
)
print(f"Boom! Audio saved to: {output_path}")
print("\nAll done! Check your output folder.")
✨ Features
- Based on F5 - TTS: Built upon the F5 - TTS architecture, leveraging its capabilities.
- Vietnamese Fine - Tuning: Fine - tuned with over 2,700,000 Vietnamese - only samples, including public and private data.
- Voice Cloning: Supports zero - shot voice cloning for both male and female voices.
- Multiple Model Checkpoints: The repo contains 4 models (
model_42000.safetensors
,model_45000.safetensors
,model_48000.safetensors
,overfit.safetensors
) for you to try.
📚 Documentation
Model Details
Property | Details |
---|---|
Model Type | EraX - Smile - UnixSex - F5, a text - to - speech model based on F5 - TTS |
Training Data | Over 2,700,000 Vietnamese - only samples, including public data and a 1000 - hour private dataset |
Usage Notes
- For the full Web interface and control with Gradio, please clone and use the original repository of [F5 - TTS Github](https://github.com/SWivid/F5 - TTS).
- We use the library from [Vinorm Team](https://github.com/v - nhandt21/Vinorm) for Vietnamese text normalization.
Future Plans
- [X] ⭐ Release checkpoints for Vietnamese male
- [ ] 📝 Codes for real - time TTS streaming
- [ ] 🔥 Release Piper - based model that can run on ...iPhone, Android, Rasberry Pi 4 or Browser 🔥
Important Note on Responsible Use
⚠️ Important Note
Voice cloning technology is powerful and comes with significant ethical responsibilities. This model is intended for creative purposes, accessibility tools, personal projects, and applications where consent is explicit and ethical considerations are prioritized. We strongly condemn and strictly prohibit the use of this model for any malicious or unethical purposes, including but not limited to creating non - consensual deepfakes, generating misinformation, harassment, or any form of criminal activity. By using this model, you agree to do so responsibly and ethically and are solely responsible for the content you generate.
License
We're using the MIT License for our codes, following in the footsteps of giants like Whisper. However, the base F5 - TTS model was pretrained with the Emilia dataset which is under BY - NC 4.0 license (non - commercial).
Citation
If you find this model helpful, a star ⭐ on our GitHub repo would be appreciated. And if you're writing a research paper, you can use the following bibtex snippet:
@misc{EraXSmileF5_2024,
author = {Nguyễn Anh Nguyên nguyen@erax.ai and The EraX Team},
title = {EraX-Smile-UnixSex-F5: Người Việt sành tiếng Việt.},
year = {2025},
publisher = {Hugging Face},
journal = {Hugging Face Model Hub},
howpublished = {\url{https://github.com/EraX-AI/viF5TTS}}
}


