🚀 Malayalam Text-to-Speech
This repository offers the Swaram (mal) text-to-speech (TTS) model checkpoint, enabling the conversion of Malayalam text into natural - sounding speech.
🚀 Quick Start
First, install the necessary libraries:
pip install --upgrade transformers accelerate
Then, run inference with the following code - snippet:
from transformers import VitsModel, AutoTokenizer
import torch
model = VitsModel.from_pretrained("aoxo/swaram")
tokenizer = AutoTokenizer.from_pretrained("aoxo/swaram")
text = "കള്ളാ കടയാടി മോനെ"
inputs = tokenizer(text, return_tensors="pt")
with torch.no_grad():
output = model(**inputs).waveform
The resulting waveform can be saved as a .wav
file:
import scipy
scipy.io.wavfile.write("kadayadi_mone.wav", rate=model.config.sampling_rate, data=output)
Or displayed in a Jupyter Notebook / Google Colab:
from IPython.display import Audio
Audio(output, rate=model.config.sampling_rate)
✨ Features
- Advanced Speech Synthesis: Swaram (Stochastic Waveform Adaptive Recurrent Autoencoder for Malayalam) is an advanced speech synthesis model that generates speech waveforms based on input text sequences.
- Conditional VAE Architecture: It is based on a conditional variational autoencoder (VAE) architecture, enhancing the quality and flexibility of speech generation.
- Stochastic Duration Predictor: To capture the one - to - many nature of TTS, the model includes a stochastic duration predictor, allowing for varied speech rhythms from the same text input.
📚 Documentation
Model Details
Swaram (Stochastic Waveform Adaptive Recurrent Autoencoder for Malayalam) is an advanced speech synthesis model that generates speech waveforms conditioned on input text sequences. It is based on a conditional variational autoencoder (VAE) architecture.
Swaram's text encoder is built on top of the Wav2Vec2 decoder. A VAE is used as the decoder. A flow - based module predicts spectrogram - based acoustic features, which is composed of the Transformer - based Contextualizer and cascaded dense layers. The spectrogram is then transformed into a speech waveform using a stack of transposed convolutional layers. To capture the one - to - many nature of TTS, where the same text can be spoken in multiple ways, the model also includes a stochastic duration predictor, allowing for varied speech rhythms from the same text input.
Architecture

📦 Installation
pip install --upgrade transformers accelerate
💻 Usage Examples
Basic Usage
from transformers import VitsModel, AutoTokenizer
import torch
model = VitsModel.from_pretrained("aoxo/swaram")
tokenizer = AutoTokenizer.from_pretrained("aoxo/swaram")
text = "കള്ളാ കടയാടി മോനെ"
inputs = tokenizer(text, return_tensors="pt")
with torch.no_grad():
output = model(**inputs).waveform
Saving the Output
import scipy
scipy.io.wavfile.write("kadayadi_mone.wav", rate=model.config.sampling_rate, data=output)
Displaying in Notebook
from IPython.display import Audio
Audio(output, rate=model.config.sampling_rate)
📄 License
The model is licensed as CC - BY - NC 4.0.