XTTS-v2 Open-Source Voice Generation Model - Clone Voices across 16 Languages with Just 6 Seconds of Audio

XTTS V2

Developed by reach-vb

ⓍTTS is an advanced voice generation model that achieves cross-lingual voice cloning with just 6 seconds of audio, supporting 16 languages.

Speech Synthesis

Transformers

Open Source License:Other #6-second voice cloning #Cross-lingual speech synthesis #Multilingual support

Downloads 125

Release Time : 11/14/2023

Model Overview

ⓍTTS is a deep learning-based voice generation model capable of cloning voices from very short audio samples and generating multilingual speech, supporting emotion and style transfer.

Model Features

Minimal sample cloning

High-quality voice cloning with just 6 seconds of audio

Multilingual support

Supports speech generation and cross-lingual cloning in 16 languages

Emotion and style transfer

Enables emotion and style conversion through cloning

Audio quality enhancement

24kHz sampling rate, comprehensively improving prosody and audio quality

Model Capabilities

Text-to-speech

Voice cloning

Cross-lingual speech generation

Emotion and style transfer

Multi-speaker reference

Voice interpolation

Use Cases

Speech synthesis

Personalized voice assistants

Create personalized voices for voice assistants

Natural and fluent personalized voice output

Multilingual content creation

Generate multilingual voiceovers for videos, podcasts, etc.

Multilingual speech maintaining the same voice characteristics

Accessibility technology

Voice restoration

Restore personal voices for individuals who have lost their speech ability

Voice output preserving personal voice characteristics

🚀 ⓍTTS

ⓍTTS is a voice generation model that enables voice cloning across different languages using just a 6 - second audio clip, eliminating the need for extensive training data spanning countless hours. It powers Coqui Studio and Coqui API.

🚀 Quick Start

The code - base supports inference and fine - tuning. You can also try the model through the following demo spaces:

XTTS Space: See how the model performs on supported languages and try it with your own reference or microphone input.
XTTS Voice Chat with Mistral or Zephyr: Experience streaming voice chat with Mistral 7B Instruct or Zephyr 7B Beta.

✨ Features

Supports 16 languages.
Enables voice cloning with just a 6 - second audio clip.
Allows emotion and style transfer through cloning.
Supports cross - language voice cloning.
Facilitates multi - lingual speech generation.
Operates at a 24kHz sampling rate.

🆕 Updates over XTTS - v1

Added support for 2 new languages: Hungarian and Korean.
Improved the architecture for speaker conditioning.
Allows the use of multiple speaker references and interpolation between speakers.
Enhanced stability.
Improved prosody and audio quality across the board.

🌐 Languages

XTTS - v2 supports 16 languages: English (en), Spanish (es), French (fr), German (de), Italian (it), Portuguese (pt), Polish (pl), Turkish (tr), Russian (ru), Dutch (nl), Czech (cs), Arabic (ar), Chinese (zh - cn), Japanese (ja), Hungarian (hu) and Korean (ko).

Stay tuned as we continue to add support for more languages. If you have any language requests, feel free to reach out!

💻 Usage Examples

Basic Usage

Using 🐸TTS API

from TTS.api import TTS
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2", gpu=True)

# generate speech by cloning a voice using default settings
tts.tts_to_file(text="It took me quite a long time to develop a voice, and now that I have it I'm not going to be silent.",
                file_path="output.wav",
                speaker_wav="/path/to/target/speaker.wav",
                language="en")

# generate speech by cloning a voice using custom settings
tts.tts_to_file(text="It took me quite a long time to develop a voice, and now that I have it I'm not going to be silent.",
                file_path="output.wav",
                speaker_wav="/path/to/target/speaker.wav",
                language="en",
                decoder_iterations=30)

Using 🐸TTS Command line

 tts --model_name tts_models/multilingual/multi-dataset/xtts_v2 \
     --text "Bugün okula gitmek istemiyorum." \
     --speaker_wav /path/to/target/speaker.wav \
     --language_idx tr \
     --use_cuda true

Advanced Usage

Using the model directly

from TTS.tts.configs.xtts_config import XttsConfig
from TTS.tts.models.xtts import Xtts

config = XttsConfig()
config.load_json("/path/to/xtts/config.json")
model = Xtts.init_from_config(config)
model.load_checkpoint(config, checkpoint_dir="/path/to/xtts/", eval=True)
model.cuda()

outputs = model.synthesize(
    "It took me quite a long time to develop a voice and now that I have it I am not going to be silent.",
    config,
    speaker_wav="/data/TTS-public/_refclips/3.wav",
    gpt_cond_len=3,
    language="en",
)

📄 License

This model is licensed under Coqui Public Model License. There's a lot that goes into a license for generative models, and you can read more of the origin story of CPML here.

📞 Contact

Come and join in our 🐸Community. We're active on Discord and Twitter. You can also mail us at info@coqui.ai.

📋 Information Table

Property	Details
Pipeline Tag	text - to - speech
License	Coqui Public Model License
License Name	coqui - public - model - license
License Link	https://coqui.ai/cpml
Demo Input	Once when I was six years old I saw a magnificent picture
Demo Output	samples/en_sample.wav

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご