speecht5_tts_hr Open-source Text-to-Speech Model - Free Deployment, Focused on Croatian Speech Conversion

Speecht5 Tts Hr

Developed by nikolab

A SpeechT5 text-to-speech fine-tuned model optimized for Croatian, trained on Microsoft's SpeechT5 architecture and the VoxPopuli dataset

Speech Synthesis

Transformers

OtherOpen Source License:MIT #Croatian TTS #Multimodal Speech Synthesis #Parliament Speech Optimization

Downloads 124

Release Time : 9/30/2024

Model Overview

A Transformer-based speech synthesis model supporting Croatian text-to-speech conversion, applicable in scenarios like voice assistants and audiobooks

Model Features

Multimodal Unified Architecture

Single architecture integrating text and speech processing capabilities, supporting cross-modal conversion

Croatian Optimization

Fine-tuned with 43 hours of Croatian VoxPopuli dataset, containing data from 83 speakers

Cross-language Extensibility

Architecture designed to support expansion to former Yugoslav languages (Montenegrin/Serbian/Bosnian)

Speaker Balance

Training data analyzed for speaker statistics to ensure balanced male-to-female speaker ratio

Model Capabilities

Croatian Text-to-Speech

Speech Synthesis

Cross-modal Representation Learning

Use Cases

Voice Assistants

Croatian Voice Interaction

Providing natural voice output for smart devices

Supports fluent speech generation for up to 20 words

Accessibility Technology

Text-to-Speech Service

Converting text content to speech for visually impaired users

🚀 SpeechT5 TTS for Croatian

This is a fine - tuned version of the SpeechT5 text - to - speech model customized for the Croatian language. It offers an effective solution for converting text into natural - sounding Croatian speech, leveraging the power of the pre - trained SpeechT5 architecture.

✨ Features

Multifunctional Model: SpeechT5 can handle speech - to - text, text - to - speech, and speech - to - speech conversions.
Wide Applicability: Suitable for various speech processing tasks such as automatic speech recognition, speech synthesis, speech translation, voice conversion, speech enhancement, and speaker identification.
Fine - tuned for Croatian: Specifically adapted to the Croatian language, improving the quality of speech synthesis in this language.

📦 Installation

No installation steps were provided in the original document, so this section is skipped.

💻 Usage Examples

No code examples were provided in the original document, so this section is skipped.

📚 Documentation

Model

SpeechT5 is a fine - tuned model for speech synthesis (text - to - speech) on the LibriTTS dataset. It was created as an upgraded version of the successful T5 model (Text - To - Text Transfer Transformer), which was trained only for natural language processing. The model was originally presented by the Microsoft research group, in the scientific paper "SpeechT5: Unified - Modal Encoder - Decoder Pre - Training for Spoken Language Processing" (https://arxiv.org/abs/2110.07205).

The SpeechT5 model was also chosen due to the extensive evaluation, carried out in the mentioned scientific work, which showed very good results in a wide range of speech processing tasks, including automatic speech recognition, speech synthesis, speech translation, voice conversion, speech enhancement and speaker identification.

SpeechT5 contains three types of speech models in one architecture. The model can be used to convert:

speech to text - for automatic speech recognition or speaker identification,
text to speech - for sound synthesis,
speech to speech - to convert between different voices or improve speech.

The SpeechT5 model consists of a common network of encoders and decoders, with an additional six neural networks that are specific to the particular modality of the data being processed (speech/text). The unique thing about the SpeechT5 model is that the model is first pre - trained on different speech - to - text and text - to - speech data modalities, so that it learns in the unified representation space of both text and speech. In this way, the model learns from text and speech at the same time. This allows us to fine - tune the pre - trained model for different tasks, such as text - to - speech, in ex - yu languages (Montenegrin, Serbian, Bosnian, Croatian).

Dataset

LibriTTS: LibriTTS (https://www.openslr.org/60/) is a multi - speaker English corpus of approximately 585 hours of English speech, prepared by Heiga Zen with the help of members of the Google Speech and Google Brain teams. This corpus is designed for TTS (text - to - speech) research. It is derived from the original LibriSpeech corpus (https://www.openslr.org/12/) - mp3 audio files from LibriVox and text files from Project Gutenberg.
VoxPopuli: The VoxPopuli dataset, published in the scientific paper link, contains:
- 400 thousand hours of untagged voice data for 23 languages.
- 1.8 thousand hours of transcribed speech data for 16 languages.
- 17.3 thousand hours of "speech - to - speech" data.
- 29 hours of transcribed speech data of non - native English speakers, intended for research into accented speech.

Technical Details

Experimental trainings of the SpeechT5 model were carried out with the aim of adapting the basic model for the use of text - to - speech conversion.

As the original SpeechT5 model was trained on tasks exclusively in English (LibriTTS dataset), it was necessary to implement the training of the new model, on the available data in Croatian language. One of the popular open datasets for this use is the VoxPopuli set, which contains sound recordings of the European Parliament from 2009 to 2020. Given that data in all regional languages is not available to the required extent, data in the Croatian language, which is the most represented, was taken from the VoxPopuli dataset. In the next stages of the project, data will be collected in Montenegrin, Serbian and Bosnian languages, in order to improve the quality of training and the accuracy of the model.

Thus, the final dataset consists of 43 transcribed hours of speech, 83 different speakers and 337 thousand transcribed tokens (1 token = 3/4 words).

In the first phase of technical implementation, the dataset went through several stages of processing in order to adapt and standardize it for training the SpeechT5 model. Data processing methods belong to the standard methods of linguistic data manipulation in the field of natural language processing (vocabulary formation, tokenization, removal or conversion of unsupported characters/letters, text/speech cleaning, text normalization).

In the next phase, the statistics of speakers in the VoxPopuli dataset were analyzed, based on which speakers with satisfactory text/speech quality and a sufficient number of samples for model training were selected. In this phase, the balancing of the dataset was carried out so that both male and female speakers, with high - quality text/speech samples, were equally represented in the training.

After the preparation of the data, the adjustment and optimization of the hyperparameters of the SpeechT5 model, which are necessary so that the training of the model can be performed quickly and efficiently, with satisfactory accuracy, was started. Several experimental training sessions were performed to obtain the optimal hyperparameters, which were then used in the evaluation phase of the model.

The evaluation of the model, on the dataset intended for testing, showed promising results. The model started to learn on the prepared dataset, but it also showed certain limitations. The main limitation is related to the length of the input text sequence. The model showed the inability to generate speech for long sequences of input text (over 20 words). The limitation was overcome by dividing the input sequence into smaller units and in that form passed to the model for processing. The main reason for the appearance of this limitation lies primarily in the lack of a large amount of data on which it is necessary to fine - tune the model in order to obtain the best possible results.

📄 License

The model is released under the MIT license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご