V

Vits Vctk

Developed by kakao-enterprise
VITS is an end-to-end speech synthesis model capable of predicting corresponding speech waveforms from input text sequences. The model employs a conditional variational autoencoder (VAE) architecture, including a posterior encoder, decoder, and conditional prior module.
Downloads 3,601
Release Time : 8/31/2023

Model Overview

VITS is an end-to-end speech synthesis model based on adversarial learning, capable of predicting corresponding speech waveforms from input text sequences. The model uses a conditional variational autoencoder (VAE) architecture, supporting the generation of speech with varying rhythms from the same text.

Model Features

End-to-end speech synthesis
Directly predicts corresponding speech waveforms from input text sequences without intermediate feature extraction.
Conditional variational autoencoder architecture
Employs a conditional variational autoencoder (VAE) architecture, including posterior encoder, decoder, and conditional prior module.
Stochastic duration predictor
Innovatively introduces a stochastic duration predictor, enabling the generation of speech with varying rhythms from the same text.
Multi-speaker support
Offers single-speaker and multi-speaker versions, supporting 109 different accents.

Model Capabilities

Text-to-speech
Multi-speaker speech synthesis
Variable rhythm speech generation

Use Cases

Speech synthesis
Voice assistants
Provides natural speech synthesis capabilities for voice assistants.
Generates natural and fluent speech output.
Audiobooks
Converts text content into speech for audiobook production.
Supports speech generation with different rhythms and accents.
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase