F5-TTS-Vietnamese-100h Open-source Model - Supports Vietnamese Speech Synthesis, for Research Only!

F5 TTS Vietnamese 100h

Developed by hynt

A compact version fine-tuned based on F5-TTS, trained with 150 hours of Vietnamese speech data, for research purposes only.

Speech Synthesis

PyTorch

Other#Vietnamese speech synthesis #150-hour fine-tuning #For academic research only

Downloads 123

Release Time : 3/23/2025

Model Overview

This is a text-to-speech (TTS) model optimized for Vietnamese, fine-tuned based on the F5-TTS architecture, suitable for Vietnamese speech synthesis tasks.

Model Features

High-quality Vietnamese speech synthesis

Trained with 150 hours of carefully selected Vietnamese speech data, providing high-quality speech synthesis results.

Strict data processing

Used demucs to remove background music, filtered audio shorter than 1 second or longer than 30 seconds to ensure data quality.

Academic collaboration datasets

Includes VLSP series datasets and 50 hours of high-quality annotated data provided by UEH University.

Model Capabilities

Vietnamese text-to-speech

Speech synthesis

Voice cloning (via reference audio)

Use Cases

Academic research

Vietnamese speech synthesis research

Used for research and experiments in speech synthesis technology.

Educational applications

Vietnamese learning assistance

Provides pronunciation references for Vietnamese learners.

🚀 F5-TTS-Vietnamese-150h

A compact fine-tuned version of F5-TTS trained on 150 hours of Vietnamese speech, which is mainly used for research purposes.

🚀 Quick Start

🛑 Important Note ⚠️

⚠️ Important Note

This model is only intended for research purposes. Access requests must be made using an institutional, academic, or corporate email. Requests from public email providers will be denied. We appreciate your understanding.

📚 Documentation

📌 Model Details

Property	Details
Model Type	A fine - tuned version of F5 - TTS
Training Data	VLSP 2021, VLSP 2022, VLSP 2023, VietTTS, TeacherDinh - UEH and some speech sources from YouTube channels
Total dataset durations	150 hours
Data processing Technique	1. Remove all music background from audios, using facebook demucs model: https://github.com/facebookresearch/demucs 2. Do not use audio files shorter than 1 second or longer than 30 seconds. 3. Keep the default punctuation marks unchanged. 4. Normalize to lowercase format.
Training Configuration	Base Model: F5 - TTS_Base; GPU: RTX 3090; Batch Size: 3200 frames
Training Progress	Stopped at 500,000 steps

🛑 Update Note

Thank you, Teacher Định from the University of Economics Ho Chi Minh City (UEH), for providing me with an additional 50 - hours high - quality labeled dataset. His contact: https://www.facebook.com/luudinhit93

💻 Usage Examples

Basic Usage

To load and use the model, follow the example below:

git clone https://github.com/nguyenthienhy/F5-TTS-Vietnamese
cd F5-TTS-Vietnamese
python -m pip install -e.
f5-tts_infer-cli \
--model "F5TTS_Base" \
--ref_audio ref.wav \
--ref_text "cả hai bên hãy cố gắng hiểu cho nhau" \
--gen_text "mình muốn ra nước ngoài để tiếp xúc nhiều công ty lớn, sau đó mang những gì học được về việt nam giúp xây dựng các công trình tốt hơn" \
--speed 1.0 \
--vocoder_name vocos \
--vocab_file data/your_training_dataset/vocab.txt \
--ckpt_file ckpts/your_training_dataset/model_500000.pt

📄 License

This model is released under the [CC - BY - NC - SA - 4.0](https://spdx.org/licenses/CC - BY - NC - SA - 4.0) license, which allows non - commercial research use only.

🔗 For more fine - tuning experiments, visit: https://github.com/nguyenthienhy/F5 - TTS - Vietnamese.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご