Open-source model speecht5_tts-finetuned-nst-da - Focused on efficient text-to-speech synthesis in Danish

Speecht5 Tts Finetuned Nst Da

Developed by JackismyShephard

This is a Danish text-to-speech model fine-tuned based on Microsoft's SpeechT5 model, specializing in Danish speech synthesis.

Speech Synthesis

Transformers

OtherOpen Source License:MIT #Danish TTS #Speech Synthesis Optimization #Low-Resource Languages

Downloads 44

Release Time : 1/1/2024

Model Overview

This model is a fine-tuned version of microsoft/speecht5_tts on the NST Danish ASR database dataset, designed for Danish text-to-speech synthesis.

Model Features

Danish Language Support

Specialized in Danish speech synthesis, filling the gap for open-source Danish TTS models.

Lightweight Alternative

Provides a simpler yet well-performing alternative compared to other Danish TTS models.

Voice Enhancement Compatibility

Generated speech can be enhanced for quality using ResembleAI/resemble-enhance.

Model Capabilities

Danish Text-to-Speech

Short to Medium-Length Text Synthesis

Use Cases

Speech Synthesis

Danish Voice Assistants

Provides speech synthesis capabilities for Danish voice assistants.

Audiobook Generation

Converts Danish text into speech for audiobooks.

🚀 speecht5_tts-finetuned-nst-da

This model is a fine-tuned version of microsoft/speecht5_tts on the NST Danish ASR Database dataset. It aims to provide a high - quality solution for Danish text - to - speech synthesis, with reasonable output quality and inference time.

🚀 Quick Start

This model is designed for Danish text - to - speech synthesis. An example script showing how to use the model for inference can be found here.

✨ Features

Alternative for Danish TTS: Given that Danish is a low - resource language, there are not many open - source Danish text - to - speech synthesizers. This model provides a simpler alternative with good performance in output quality and inference time.
Easy - to - use Interface: It has an associated Space on 🤗 at JackismyShephard/danish-speech-synthesis for easy Danish text - to - speech synthesis and optional speech enhancement.

📦 Installation

The README does not provide specific installation steps, so this section is skipped.

💻 Usage Examples

The README does not provide code examples, so this section is skipped.

📚 Documentation

Model Description

Given that Danish is a low - resource language, not many open - source implementations of a Danish text - to - speech synthesizer are available online. As of writing, the only other existing implementations available on 🤗 are facebook/seamless-streaming and audo/seamless-m4t-v2-large. This model has been developed to provide a simpler alternative that still performs reasonably well, both in terms of output quality and inference time. Additionally, contrary to the aforementioned models, this model also has an associated Space on 🤗 at JackismyShephard/danish-speech-synthesis which provides an easy interface for Danish text - to - speech synthesis, as well as optional speech enhancement.

Intended Uses & Limitations

Intended Use: The model is intended for Danish text - to - speech synthesis.
Limitations:
- The model does not recognize special symbols such as "æ", "ø" and "å", as it uses the default tokenizer of microsoft/speecht5_tts.
- The model performs best for short - to - medium - length input text and expects input text to contain no more than 600 vocabulary tokens.
- For best performance, the model should be given a Danish speaker embedding, ideally generated from an audio clip from the training split of alexandrainst/nst-da using speechbrain/spkrec-xvect-voxceleb.
- The output of the model is a log - mel spectogram, which should be converted to a waveform using microsoft/speecht5_hifigan. For better quality output, the resulting waveform can be enhanced using ResembleAI/resemble-enhance.

Training and Evaluation Data

The model was trained and evaluated on alexandrainst/nst-da using MSE as both loss and metric. The dataset was pre - processed as follows:

Special characters, such as "æ", "ø" and "å" were translated to their latin equivalents and examples with text containing digits were removed, as neither are in the vocabulary of the tokenizer of microsoft/speecht5_tts.
Training split balancing was done by excluding speakers with less than 280 examples or more than 327 examples.
Audio was enhanced using speechbrain/metricgan-plus-voicebank in an attempt to remove unwanted noise.

Training Procedure

The script used for training the model (and pre - processing its data) can be found here.

Training Hyperparameters

The following hyperparameters were used during training:

learning_rate: 1e - 05
train_batch_size: 16
eval_batch_size: 16
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon = 1e - 08
lr_scheduler_type: linear
lr_scheduler_warmup_ratio: 0.1
num_epochs: 20
mixed_precision_training: Native AMP

Training Results

Training Loss	Epoch	Step	Validation Loss
0.4445	1.0	9429	0.4100
0.4169	2.0	18858	0.3955
0.412	3.0	28287	0.3882
0.3982	4.0	37716	0.3826
0.4032	5.0	47145	0.3817
0.3951	6.0	56574	0.3782
0.3971	7.0	66003	0.3782
0.395	8.0	75432	0.3757
0.3952	9.0	84861	0.3749
0.3835	10.0	94290	0.3740
0.3863	11.0	103719	0.3754
0.3845	12.0	113148	0.3732
0.3788	13.0	122577	0.3715
0.3834	14.0	132006	0.3717
0.3894	15.0	141435	0.3718
0.3845	16.0	150864	0.3714
0.3823	17.0	160293	0.3692
0.3858	18.0	169722	0.3703
0.3919	19.0	179151	0.3716
0.3906	20.0	188580	0.3709

Framework Versions

Transformers 4.37.2
Pytorch 2.1.1+cu121
Datasets 2.17.0
Tokenizers 0.15.2

🔧 Technical Details

The README does not provide specific technical details (more than 50 - word technical descriptions), so this section is skipped.

📄 License

The model is released under the MIT license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご