Swaram: An Open-source Malayalam Text-to-Speech Synthesis Model - Generate High-quality Speech Waveforms by Inputting Texts

Swaram

Developed by aoxo

Swaram is an advanced Malayalam speech synthesis model capable of generating high-quality speech waveforms from input text.

Speech Synthesis

Safetensors

Other#Malayalam speech synthesis #Variational Autoencoder #Stochastic duration prediction

Downloads 735

Release Time : 12/10/2024

Model Overview

This model is based on a conditional variational autoencoder (VAE) architecture, specifically designed for Malayalam text-to-speech tasks, producing natural and fluent speech output.

Model Features

Variational Autoencoder Architecture

Uses conditional variational autoencoder as the core architecture to capture diversity in speech synthesis.

Stochastic Duration Prediction

Built-in stochastic duration predictor enables the same text to produce speech outputs with varying rhythms.

High-Quality Waveform Generation

Converts spectrograms into high-quality speech waveforms through a stack of transposed convolutional layers.

Model Capabilities

Malayalam text-to-speech

Speech waveform generation

Diverse speech synthesis

Use Cases

Voice Applications

Voice Assistants

Provides natural speech synthesis capabilities for Malayalam voice assistants.

Generates natural and fluent speech output.

Audiobooks

Converts Malayalam text into speech for audiobook production.

Supports diverse pronunciation styles.

🚀 Malayalam Text-to-Speech

This repository offers the Swaram (mal) text-to-speech (TTS) model checkpoint, enabling the conversion of Malayalam text into natural - sounding speech.

🚀 Quick Start

First, install the necessary libraries:

pip install --upgrade transformers accelerate

Then, run inference with the following code - snippet:

from transformers import VitsModel, AutoTokenizer
import torch

model = VitsModel.from_pretrained("aoxo/swaram")
tokenizer = AutoTokenizer.from_pretrained("aoxo/swaram")

text = "കള്ളാ കടയാടി മോനെ"
inputs = tokenizer(text, return_tensors="pt")

with torch.no_grad():
    output = model(**inputs).waveform

The resulting waveform can be saved as a .wav file:

import scipy

scipy.io.wavfile.write("kadayadi_mone.wav", rate=model.config.sampling_rate, data=output)

Or displayed in a Jupyter Notebook / Google Colab:

from IPython.display import Audio

Audio(output, rate=model.config.sampling_rate)

✨ Features

Advanced Speech Synthesis: Swaram (Stochastic Waveform Adaptive Recurrent Autoencoder for Malayalam) is an advanced speech synthesis model that generates speech waveforms based on input text sequences.
Conditional VAE Architecture: It is based on a conditional variational autoencoder (VAE) architecture, enhancing the quality and flexibility of speech generation.
Stochastic Duration Predictor: To capture the one - to - many nature of TTS, the model includes a stochastic duration predictor, allowing for varied speech rhythms from the same text input.

📚 Documentation

Model Details

Swaram (Stochastic Waveform Adaptive Recurrent Autoencoder for Malayalam) is an advanced speech synthesis model that generates speech waveforms conditioned on input text sequences. It is based on a conditional variational autoencoder (VAE) architecture.

Swaram's text encoder is built on top of the Wav2Vec2 decoder. A VAE is used as the decoder. A flow - based module predicts spectrogram - based acoustic features, which is composed of the Transformer - based Contextualizer and cascaded dense layers. The spectrogram is then transformed into a speech waveform using a stack of transposed convolutional layers. To capture the one - to - many nature of TTS, where the same text can be spoken in multiple ways, the model also includes a stochastic duration predictor, allowing for varied speech rhythms from the same text input.

Architecture

architecture

📦 Installation

pip install --upgrade transformers accelerate

💻 Usage Examples

Basic Usage

from transformers import VitsModel, AutoTokenizer
import torch

model = VitsModel.from_pretrained("aoxo/swaram")
tokenizer = AutoTokenizer.from_pretrained("aoxo/swaram")

text = "കള്ളാ കടയാടി മോനെ"
inputs = tokenizer(text, return_tensors="pt")

with torch.no_grad():
    output = model(**inputs).waveform

Saving the Output

import scipy

scipy.io.wavfile.write("kadayadi_mone.wav", rate=model.config.sampling_rate, data=output)

Displaying in Notebook

from IPython.display import Audio

Audio(output, rate=model.config.sampling_rate)

📄 License

The model is licensed as CC - BY - NC 4.0.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご