XTTS-v2 Open-source Voice Generation Model - Supports 17 Languages, Cross-lingual Synthesis with 6-second Voice Cloning

XTTS V2

Developed by shadialhakimi

ⓍTTS-v2 is an advanced voice generation model that supports 17 languages. It can clone voices and achieve cross-lingual voice synthesis with just a 6-second audio clip.

Speech Synthesis Open Source License:Other #6-second voice cloning #Multilingual voice generation #Emotional style transfer

Downloads 6

Release Time : 10/24/2024

Model Overview

XTTS-v2 is a text-to-speech model developed by Coqui AI. It has the capabilities of high-quality voice synthesis, voice cloning, and cross-lingual conversion. It supports various emotional and style transfers, with a sampling rate of 24kHz.

Model Features

Multilingual support

Supports voice synthesis and voice cloning in 17 languages

Fast voice cloning

Can clone the target voice with just a 6-second audio clip

Cross-lingual conversion

Can use the cloned voice for voice synthesis in different languages

Emotional style transfer

Can preserve and convert the emotional and style features of the original voice

High-quality output

The 24kHz sampling rate provides high-quality voice synthesis results

Model Capabilities

Text-to-speech

Voice cloning

Cross-lingual voice synthesis

Emotional style conversion

Multi-speaker interpolation

Use Cases

Content creation

Audiobook production

Use the cloned voice to dub audiobooks in different languages

Maintain a consistent narrative voice while supporting multilingual versions

Video dubbing

Generate multilingual dubbing for video content

Quickly create localized content

Assistive technology

Voice assistive devices

Provide personalized voice options for voice assistive devices

Enhance user experience and accessibility

Education

Language learning

Generate pronunciation examples in the target language

Help learners master correct pronunciation

🚀 ⓍTTS

ⓍTTS is a voice generation model that enables you to clone voices into different languages using just a quick 6 - second audio clip. There's no need for an excessive amount of training data spanning countless hours. This model is the same as or similar to the one powering Coqui Studio and Coqui API.

🚀 Quick Start

The code - base supports inference and fine - tuning. You can also try out the model through the following demo spaces:

XTTS Space: You can see how the model performs on supported languages and try it with your own reference or microphone input.
XTTS Voice Chat with Mistral or Zephyr: You can experience streaming voice chat with Mistral 7B Instruct or Zephyr 7B Beta.

✨ Features

Supports 17 languages.
Enables voice cloning with just a 6 - second audio clip.
Allows emotion and style transfer by cloning.
Supports cross - language voice cloning.
Facilitates multi - lingual speech generation.
Has a 24khz sampling rate.

🆕 Updates over XTTS - v1

Added support for 2 new languages: Hungarian and Korean.
Made architectural improvements for speaker conditioning.
Enabled the use of multiple speaker references and interpolation between speakers.
Improved stability.
Enhanced prosody and audio quality across the board.

🌐 Languages

XTTS - v2 supports 17 languages: English (en), Spanish (es), French (fr), German (de), Italian (it), Portuguese (pt), Polish (pl), Turkish (tr), Russian (ru), Dutch (nl), Czech (cs), Arabic (ar), Chinese (zh - cn), Japanese (ja), Hungarian (hu), Korean (ko), Hindi (hi). Stay tuned as we continue to add support for more languages. If you have any language requests, feel free to reach out!

💻 Usage Examples

Basic Usage

from TTS.api import TTS
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2", gpu=True)

# generate speech by cloning a voice using default settings
tts.tts_to_file(text="It took me quite a long time to develop a voice, and now that I have it I'm not going to be silent.",
                file_path="output.wav",
                speaker_wav="/path/to/target/speaker.wav",
                language="en")

Advanced Usage

from TTS.tts.configs.xtts_config import XttsConfig
from TTS.tts.models.xtts import Xtts

config = XttsConfig()
config.load_json("/path/to/xtts/config.json")
model = Xtts.init_from_config(config)
model.load_checkpoint(config, checkpoint_dir="/path/to/xtts/", eval=True)
model.cuda()

outputs = model.synthesize(
    "It took me quite a long time to develop a voice and now that I have it I am not going to be silent.",
    config,
    speaker_wav="/data/TTS-public/_refclips/3.wav",
    gpt_cond_len=3,
    language="en",
)

Command - line Usage

 tts --model_name tts_models/multilingual/multi-dataset/xtts_v2 \
     --text "Bugün okula gitmek istemiyorum." \
     --speaker_wav /path/to/target/speaker.wav \
     --language_idx tr \
     --use_cuda true

📄 License

This model is licensed under Coqui Public Model License. There's a lot that goes into a license for generative models, and you can read more of the origin story of CPML here.

📞 Contact

Come and join our 🐸Community. We're active on Discord and Twitter. You can also mail us at info@coqui.ai.

Property	Details
Library Name	coqui
Pipeline Tag	text - to - speech
License Name	coqui - public - model - license
License Link	https://coqui.ai/cpml

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご