Open-source Swedish text-to-speech model speecht5_tts_common_voice_5_sv

Speecht5 Tts Common Voice 5 Sv

Developed by GreenCounsel

A Swedish text-to-speech model fine-tuned based on Microsoft's SpeechT5 architecture, trained using the Common Voice dataset

Speech Synthesis

Transformers

OtherOpen Source License:MIT #Swedish TTS #Multi-speaker support #Speech synthesis

Downloads 27

Release Time : 6/23/2023

Model Overview

This model can convert Swedish text into natural speech output, suitable for speech synthesis applications

Model Features

High-quality speech synthesis

Based on SpeechT5 architecture and HiFi-GAN vocoder, capable of generating natural and fluent Swedish speech

Multi-speaker support

Achieves speech synthesis with different speaker styles through x-vector technology

Special character handling

Built-in automatic conversion for Swedish special characters (e.g., Ä, Å, Ö)

Model Capabilities

Swedish text-to-speech

Multi-speaker speech synthesis

Automatic special character processing

Use Cases

Assistive technology

Screen reader

Provides speech output of Swedish content for visually impaired users

Content creation

Audio content generation

Automatically converts Swedish text into speech for podcasts or video dubbing

🚀 SpeechT5 TTS Swedish

This model is a fine - tuned version of microsoft/speecht5_tts on the Common Voice dataset, used for Swedish text - to - speech conversion.

🚀 Quick Start

This Swedish SpeechT5 model is trained on the Swedish language in the Common Voice dataset. You can test the model yourself at https://huggingface.co/spaces/GreenCounsel/SpeechT5-sv (it's not possible to run pipeline inference at Huggingface).

💻 Usage Examples

Basic Usage

#pip install datasets soundfile 
#pip install transformers
#pip install sentencepiece

from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan, set_seed
import torch

processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_tts")
model = SpeechT5ForTextToSpeech.from_pretrained("GreenCounsel/speecht5_tts_common_voice_5_sv")
vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")

repl = [
    ('Ä', 'ae'),
    ('Å', 'o'),
    ('Ö', 'oe'),
    ('ä', 'ae'),
    ('å', 'o'),
    ('ö', 'oe'),
    ('ô','oe'),
    ('-',''),
    ('‘',''),
    ('’',''),
    ('“',''),
    ('”',''),

]

from datasets import load_dataset
embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")

speaker_embeddings = torch.tensor(embeddings_dataset[7000]["xvector"]).unsqueeze(0)
set_seed(555)

text="Förstår du vad han menar?"
for src, dst in repl:
       text = text.replace(src, dst)
inputs = processor(text=text, return_tensors="pt")

speech = model.generate_speech(inputs["input_ids"], speaker_embeddings, vocoder=vocoder)

import soundfile as sf
sf.write("output.wav", speech.numpy(), samplerate=16000)

📚 Documentation

This model achieves the following results on the evaluation set:

Loss: 0.4621

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 1e - 05
train_batch_size: 16
eval_batch_size: 8
seed: 42
gradient_accumulation_steps: 2
total_train_batch_size: 32
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e - 08
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 500
training_steps: 4000
mixed_precision_training: Native AMP

Training results

Training Loss	Epoch	Step	Validation Loss
0.5349	4.8	1000	0.4953
0.5053	9.59	2000	0.4714
0.5032	14.39	3000	0.4646
0.4958	19.18	4000	0.4621

Framework versions

Transformers 4.30.0.dev0
Pytorch 2.0.1+cu118
Datasets 2.13.1
Tokenizers 0.13.3

📄 License

This project is under the MIT license.

📦 Information

Property	Details
Model Type	Fine - tuned SpeechT5 model for Swedish text - to - speech
Training Data	mozilla - foundation/common_voice_13_0
Pipeline Tag	text - to - speech
Inference	false
Tags	common_voice, generated_from_trainer
Language	sv

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご