ADIA_TTS is an open-source Wolof speech synthesis model developed by CONCREE, based on the parler-tts-mini-multilingual-v1.1 model, achieving significant progress in Wolof speech synthesis.
ADIA_TTS is a text-to-speech model focused on the Wolof language, capable of generating natural and fluent speech while controlling voice characteristics through descriptions.
Model Features
Multilingual Support
Based on the parler-tts-mini-multilingual-v1.1 model, it supports multiple languages including Wolof.
High-Quality Speech Synthesis
Generates natural and fluent speech suitable for various application scenarios.
Voice Style Control
Controls voice characteristics through descriptions, such as clear, professional, or educational tones.
Efficient Training
Trained on 40 hours of Wolof speech data and fine-tuned over 100 epochs.
Model Capabilities
Wolof text-to-speech
Voice style control
High-quality speech generation
Use Cases
Education
Language Learning
Used for speech synthesis in Wolof learning materials to help learners improve listening comprehension.
Generates clear and educational speech suitable for learning.
Professional Applications
Formal Speeches
Generates professional, clear, and composed speech suitable for formal occasions.
High-quality speech suitable for formal settings.
Daily Applications
Natural Conversations
Generates warm and natural speech suitable for daily conversations and interactions.
Fluid speech close to natural conversation.
🚀 Adia_TTS Wolof
Adia_TTS is an open - source Wolof text - to - speech model developed by CONCREE, based on the parler - tts - mini - multilingual - v1.1 model, which represents a significant advancement in Wolof text - to - speech synthesis.
🚀 Quick Start
ADIA_TTS is an open - source Wolof text - to - speech (TTS) model developed by CONCREE. Based on the parler - tts - mini - multilingual - v1.1 model, it marks a significant step forward in TTS for the Wolof language.
✨ Features
Trained on 40 hours of Wolof vocal data.
Fine - tuned for 100 epochs (~168 hours of training).
Natural and fluent vocal quality.
Single voice with control over vocal characteristics via description.
import torch
from parler_tts import ParlerTTSForConditionalGeneration
from transformers import AutoTokenizer
import soundfile as sf
device = "cuda:0"if torch.cuda.is_available() else"cpu"# Loading the model
model = ParlerTTSForConditionalGeneration.from_pretrained("CONCREE/Adia_TTS").to(device)
tokenizer = AutoTokenizer.from_pretrained("CONCREE/Adia_TTS")
# Wolof text to synthesize
text = "Entreprenariat ci Senegal dafa am solo lool ci yokkuteg koom-koom, di gëna yokk liggéey ak indi gis-gis yu bees ci dëkk bi."# Vocal style description
description = "A clear and educational voice, with a flow adapted to learning"# Generation
input_ids = tokenizer(description, return_tensors="pt").input_ids.to(device)
prompt_ids = tokenizer(text, return_tensors="pt").input_ids.to(device)
audio = model.generate(
input_ids=input_ids,
prompt_input_ids=prompt_ids,
)
# Saving
sf.write("output.wav", audio.cpu().numpy().squeeze(), model.config.sampling_rate)
Advanced Usage
generation_config = {
"temperature": 0.8, # Controls the variability of the output"max_new_tokens": 1000, # Maximum length of the generated sequence"do_sample": True, # Enables random sampling"top_k": 50, # Limits the number of considered tokens"repetition_penalty": 1.2, # Penalizes token repetition
}
audio = model.generate(
input_ids=input_ids,
prompt_input_ids=prompt_ids,
**generation_config
)
Different Vocal Styles
Natural Voice
description = "A warm and natural voice, with a conversational flow"
Professional Voice
description = "A professional, clear and composed voice, perfect for formal presentations"
Educational Voice
description = "A clear and educational voice, with a flow adapted to learning"
📚 Documentation
Technical Specifications
Property
Details
Model Type
parler - tts - mini - multilingual - v1.1
Model Size
1.88 GB
Model Format
PyTorch
Sampling Frequency
24kHz
Audio Encoding
16 - bit PCM
Performance
Property
Details
Average Inference Time
seconds/sentence (CPU), 20 seconds/sentence (GPU)
Memory Consumption
3.9 GB (Recommended minimum RAM)
Limitations
Reduced performance on very long sentences.
Limited handling of numbers and dates.
Relatively longer initial model loading time.
The model is limited to a maximum of 200 characters per inference without segmentation. Manual segmentation is required for longer texts.
The quality of transitions between segments may vary depending on the chosen segmentation method.
It is recommended to segment the text at natural boundaries (sentences, paragraphs) for better results.
References
@misc{CONCREE-2024-Adia_TTS,
author = {CONCREE},
title = {Adia_TTS},
year = {2025},
publisher = {Hugging Face},
journal = {Hugging Face repository},
howpublished = {\url{https://huggingface.co/CONCREE/Adia_TTS}}
}
@misc{lyth2024natural,
title={Natural language guidance of high - fidelity text - to - speech with synthetic annotations},
author={Dan Lyth and Simon King},
year={2024},
eprint={2402.01912},
archivePrefix={arXiv},
primaryClass={cs.SD}
}
📄 License
This project is licensed under the Apache 2.0 license. See the LICENSE file for more details.
Usage Conditions
Users are committed to using the model in a way that respects the Wolof language and Senegalese culture.
We encourage the use of this model to develop solutions that improve digital accessibility for Wolof speakers and contribute to reducing the digital divide. Projects aiming at digital inclusion are particularly welcome.
Any use of the model must mention CONCREE as the original creator. Users are strongly encouraged to share their improvements with the community.
Commercial use is permitted under the terms of the Apache 2.0 license.