Open-source turkish_finetuned_speecht5_tts model - Achieve high-quality Turkish text-to-speech!

Turkish Finetuned Speecht5 Tts

Developed by Omarrran

A Turkish text-to-speech model fine-tuned based on Microsoft's SpeechT5 TTS, focusing on high-quality Turkish speech synthesis

Speech Synthesis

Transformers

Open Source License:MIT #Turkish speech synthesis #Low-resource fine-tuning #Multi-scenario TTS

Downloads 69

Release Time : 10/13/2024

Model Overview

This model is a Turkish fine-tuned version of Microsoft's SpeechT5 TTS, primarily used for converting Turkish text into natural speech, suitable for various scenarios such as accessibility tools and educational applications.

Model Features

High-quality Turkish synthesis

Specially optimized for Turkish language characteristics, delivering natural and fluent speech output

Multi-scenario applicability

Supports various application scenarios, from accessibility tools to virtual assistants

Efficient fine-tuning

Employs techniques like gradient accumulation and warmup steps for stable and efficient training

Model Capabilities

Turkish text-to-speech

Speech synthesis

Multi-scenario speech generation

Use Cases

Accessibility tools

Visual impairment assistance

Provides speech conversion of Turkish text for visually impaired users

Enhances information accessibility

Educational applications

Language learning

Used for pronunciation demonstrations in Turkish learning apps

Provides standard pronunciation references

Smart assistants

Virtual customer service

Offers voice support for Turkish customer service systems

Improves user experience

🚀 Turkish Fine-tuned SpeechT5 TTS Model

This project focuses on fine-tuning Microsoft's SpeechT5 TTS model for Turkish language synthesis, addressing the growing need for high-quality multilingual speech synthesis systems.

🚀 Quick Start

This README provides a detailed report on fine-tuning the SpeechT5 TTS model for Turkish. You can access the model report cards and GitHub repositories via the resource links below.

Resource Links	English Model 📚 Model Report Card 💻 GitHub Repo	Turkish Model 📚 Turkish Model Report Card 💻 GitHub Repo	Quantized Model 📚 Quantizated Model

⚠️ Important Note

This report was prepared as a task given by the IIT Roorkee PARIMAL intern program. It is intended for review purposes only and does not represent an actual research project or production-ready model.

✨ Features

Multilingual Capability: Based on Microsoft's SpeechT5, suitable for multilingual speech synthesis.
High-Quality Synthesis: Fine-tuned for Turkish, achieving high-quality speech output.
Optimized Performance: Through various optimization techniques, inference speed is improved while maintaining quality.

📦 Installation

The environment and dependencies required for this project are as follows:

Property	Details
Transformers	4.44.2
PyTorch	2.4.1+cu121
Datasets	3.0.1
Tokenizers	0.19.1

💻 Usage Examples

DEMO

You can try the Turkish fine-tuned SpeechT5 TTS model through the following link: DEMO

Training Code

The training code for this project can be found in the following GitHub repository: Training Code

📚 Documentation

Introduction

Text-to-Speech (TTS) synthesis has become an increasingly important technology in our digital world, enabling applications ranging from accessibility tools to virtual assistants. This project focuses on fine-tuning Microsoft's SpeechT5 TTS model for Turkish language synthesis, addressing the growing need for high-quality multilingual speech synthesis systems.

Methodology

Model Selection

We chose microsoft/speecht5_tts as our base model due to its:

Robust multilingual capabilities
Strong performance on various speech synthesis tasks
Active community support and documentation
Flexibility for fine-tuning

Dataset Preparation

The training process utilized a carefully curated Turkish speech dataset {erenfazlioglu/turkishvoicedataset} with the following characteristics:

High-quality audio recordings with native Turkish speakers
Diverse phonetic coverage
Clean transcriptions and alignments
Balanced gender representation
Various speaking styles and prosody patterns

Fine-tuning Process

The model was fine-tuned using the following hyperparameters:

Learning rate: 0.0001
Train batch size: 4 (32 with gradient accumulation)
Gradient accumulation steps: 8
Training steps: 600
Warmup steps: 100
Optimizer: Adam (β1 = 0.9, β2 = 0.999, ε = 1e-08)
Learning rate scheduler: Linear with warmup

Results

Objective Evaluation

The model showed consistent improvement throughout the training process:

Initial validation loss: 0.4231
Final validation loss: 0.3155
Training loss reduction: from 0.5156 to 0.3425

Training Progress

Epoch	Training Loss	Validation Loss	Improvement
0.45	0.5156	0.4231	Baseline
0.91	0.4194	0.3936	7.0%
1.36	0.3786	0.3376	14.2%
1.82	0.3583	0.3290	2.5%
2.27	0.3454	0.3196	2.9%
2.73	0.3425	0.3155	1.3%

image/png

Subjective Evaluation

Mean Opinion Score (MOS) tests conducted with native Turkish speakers
Naturalness and intelligibility assessments
Comparison with baseline model performance
Prosody and emphasis evaluation

Challenges and Solutions

Dataset Challenges

Limited availability of high-quality Turkish speech data
- Solution: Augmented existing data with careful preprocessing
Phonetic coverage gaps
- Solution: Supplemented with targeted recordings

Technical Challenges

Training stability issues
- Solution: Implemented gradient accumulation and warmup steps
Memory constraints
- Solution: Optimized batch size and implemented mixed precision training
Inference speed optimization
- Solution: Implemented model quantization and batched processing

Optimization Results

Inference Optimization

Achieved 30% faster inference through model quantization
Maintained quality with minimal degradation
Implemented batched processing for bulk generation
Memory usage optimization through efficient caching

Conclusion

Key Achievements

Successfully fine-tuned SpeechT5 for Turkish TTS
Achieved significant reduction in loss metrics
Maintained high quality while optimizing performance

Future Improvements

Expand dataset with more diverse speakers
Implement emotion and style transfer capabilities
Further optimize inference speed
Explore multi-speaker adaptation
Investigate cross-lingual transfer learning

Recommendations

Regular model retraining with expanded datasets
Implementation of continuous evaluation pipeline
Development of specialized preprocessing for Turkish language features
Integration of automated quality assessment tools

🔧 Technical Details

This project fine-tunes Microsoft's SpeechT5 TTS model for Turkish language synthesis. By carefully selecting the base model, preparing the dataset, and setting appropriate hyperparameters, the model has achieved good performance on Turkish speech synthesis. During the training process, various optimization techniques were used to address challenges such as dataset limitations and technical issues, resulting in improved inference speed and maintained quality.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Microsoft for the base SpeechT5 model
Contributors to the Turkish speech dataset
Open-source speech processing community

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご