F5-TTS-italian Open-Source Text-to-Speech Model - Easily Convert Italian Text to Speech for Free

F5 TTS Italian

Developed by alien79

Italian-specific text-to-speech model based on the F5-TTS architecture, fine-tuned with 73+ hours of Italian data

Speech Synthesis Other#Italian-specific TTS #Punctuation-sensitive speech synthesis #Preservation of dialogue dashes

Downloads 775

Release Time : 12/11/2024

Model Overview

Text-to-speech model optimized for Italian, supporting Italian speech generation but unable to handle other languages like English

Model Features

Italian Optimization

Specially fine-tuned for Italian, optimizing pronunciation and intonation

Punctuation Retention

Punctuation is retained during training to help the model learn correct pauses and intonation

Multi-speaker Support

Supports zero-shot voice cloning, enabling imitation of different speakers' vocal characteristics

Model Capabilities

Italian text-to-speech

Voice cloning

Punctuation-aware speech generation

Use Cases

Speech synthesis

Italian audiobook generation

Convert Italian text into natural speech

Generate Italian speech with appropriate pauses and intonation

Voice assistant development

Provide speech synthesis capabilities for Italian voice assistants

🚀 F5-TTS Italian Finetune

This is an Italian finetune for F5-TTS, focusing solely on the Italian language.

Property	Details
Datasets	ylacombe/cml-tts
Language	it
Base Model	SWivid/F5-TTS
Pipeline Tag	text-to-speech
License	cc-by-4.0
Library Name	f5-tts

🚀 Quick Start

This project is an Italian finetune for F5-TTS, which only supports Italian and cannot speak English properly.

✨ Features

Trained over 73+ hours of the "train" split of the ylacombe/cml-tts dataset using 8xRTX4090. The training is still in progress.
Utilized the gradio finetuning app with the following settings:

exp_name="F5TTS_Base"
learning_rate=0.00001
batch_size_per_gpu=10000
batch_size_type="frame"
max_samples=64
grad_accumulation_steps=1
max_grad_norm=1
epochs=300
num_warmup_updates=2000
save_per_updates=600
last_per_steps=300
finetune=true
file_checkpoint_train=""
tokenizer_type="char"
tokenizer_file=""
mixed_precision="fp16"
logger="wandb"
bnb_optimizer=false

📚 Documentation

Pre processing

The data extracted from the datasource has been preprocessed in its transcription. Punctuation is preserved as it is important for teaching pauses and proper intonation. The original Italian "text" field contained direct dialogue escapes (long hyphen), which were also preserved. However, hyphens used to split a word into a new line were removed to merge the two parts of the word, as training on such artifacts did not impact the speech. This only applies to Italian data in the cml-tts dataset, and it is unknown if other languages are affected.

Current most trained model

The most trained model is model_159600.safetensors (~290 Epoch).

Known problems

Catastrophic failure: Since it is Italian-only, the model has lost English skills. A proper multilanguage dataset should be used instead of a single language.
Not perfect pronunciation: The pronunciation is not yet perfect.
Number conversion: Numbers must be converted into letters to be pronounced in Italian.
Dataset improvement: A better dataset with more diverse voices would help improve zero-shot cloning.

Checkpoints folder

The checkpoints folder contains the weights of the checkpoints at specific steps. The higher the number, the further the model has progressed in training. These weights can be used as a starting point to continue training. If you can further finetune the model to achieve better results, please let me know.

📄 License

This project is licensed under the cc-by-4.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご