MegaTTS3开源语音合成模型 - 免费支持中英文语音零样本合成

首页

Megatts3

由 RedbeardNZ 开发

MegaTTS 3是一个基于稀疏对齐增强的潜在扩散Transformer的零样本语音合成模型，支持中英文语音合成。

语音合成

Safetensors

支持多种语言开源协议:Apache-2.0 #零样本语音合成 #跨语言TTS #口音保留

下载量 26

发布时间 : 4/20/2025

模型简介

MegaTTS 3是一个先进的文本转语音模型，采用潜在扩散Transformer架构，通过稀疏对齐技术增强，能够实现高质量的零样本语音合成。

模型特点

零样本语音合成

无需针对特定说话人进行训练，即可合成高质量的语音

跨语言支持

支持中英文语音合成，并能处理带口音的语音

稀疏对齐增强

采用稀疏对齐技术提高语音合成的自然度和表现力

潜在扩散Transformer

结合潜在扩散模型和Transformer架构，实现高质量的语音生成

模型能力

文本转语音

零样本语音克隆

跨语言语音合成

带口音语音合成

使用案例

语音合成

个性化语音生成

根据少量参考音频生成个性化的语音

生成自然流畅的个性化语音

跨语言语音合成

使用一种语言的参考音频合成另一种语言的语音

保持说话人特征的同时实现跨语言合成

情感语音合成

通过调整参数控制生成语音的情感表现

生成富有表现力的情感语音

🚀 MegaTTS 3

MegaTTS 3是一个文本转语音模型，基于稀疏对齐增强的潜在扩散Transformer架构，可实现零样本语音合成。它能根据输入的文本和语音提示生成高质量语音，在多种场景下有广泛应用。

🚀 快速开始

本项目提供了MegaTTS 3模型的使用说明，涵盖了安装、推理等方面。你可以通过命令行或Web UI的方式使用该模型进行语音合成。

✨ 主要特性

零样本语音合成：基于稀疏对齐增强的潜在扩散Transformer架构，实现零样本语音合成。
多平台支持：支持Linux、Windows和Docker环境。
多种推理方式：提供命令行和Web UI两种推理方式。
口音控制：可以通过调整参数控制生成语音的口音。

📦 安装指南

克隆仓库

# Clone the repository
git clone https://github.com/bytedance/MegaTTS3
cd MegaTTS3

模型下载

huggingface-cli download ByteDance/MegaTTS3 --local-dir ./checkpoints --local-dir-use-symlinks False

Linux环境依赖安装

# Create a python 3.10 conda env (you could also use virtualenv)
conda create -n megatts3-env python=3.10
conda activate megatts3-env
pip install -r requirements.txt

# Set the root directory
export PYTHONPATH="/path/to/MegaTTS3:$PYTHONPATH"

# [Optional] Set GPU
export CUDA_VISIBLE_DEVICES=0

# If you encounter bugs with pydantic in inference, you should check if the versions of pydantic and gradio are matched.
# [Note] if you encounter bugs related with httpx, please check that whether your environmental variable "no_proxy" has patterns like "::"

Windows环境依赖安装

# [The Windows version is currently under testing]
# Comment below dependence in requirements.txt:
# # WeTextProcessing==1.0.4.1

# Create a python 3.10 conda env (you could also use virtualenv)
conda create -n megatts3-env python=3.10
conda activate megatts3-env
pip install -r requirements.txt
conda install -y -c conda-forge pynini==2.1.5
pip install WeTextProcessing==1.0.3

# [Optional] If you want GPU inference, you may need to install specific version of PyTorch for your GPU from https://pytorch.org/.
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126

# [Note] if you encounter bugs related with `ffprobe` or `ffmpeg`, you can install it through `conda install -c conda-forge ffmpeg`

# Set environment variable for root directory
set PYTHONPATH="C:\path\to\MegaTTS3;%PYTHONPATH%" # Windows
$env:PYTHONPATH="C:\path\to\MegaTTS3;%PYTHONPATH%" # Powershell on Windows
conda env config vars set PYTHONPATH="C:\path\to\MegaTTS3;%PYTHONPATH%" # For conda users

# [Optional] Set GPU
set CUDA_VISIBLE_DEVICES=0 # Windows
$env:CUDA_VISIBLE_DEVICES=0 # Powershell on Windows

Docker环境依赖安装

# [The Docker version is currently under testing]
# ! You should download the pretrained checkpoint before running the following command
docker build . -t megatts3:latest

# For GPU inference
docker run -it -p 7929:7929 --gpus all -e CUDA_VISIBLE_DEVICES=0 megatts3:latest
# For CPU inference
docker run -it -p 7929:7929  megatts3:latest

# Visit http://0.0.0.0:7929/ for gradio.

⚠️ 重要提示

由于安全问题，我们未将WaveVAE编码器的参数上传至上述链接。你只能使用从链接1预提取的潜在特征进行推理。如果你想为说话人A合成语音，需要将"A.wav"和"A.npy"放在同一目录下。如果你对我们的模型有任何疑问或建议，请发邮件联系我们。

本项目主要用于学术目的。对于需要评估的学术数据集，你可以将其上传至链接2的语音请求队列（每个片段时长不超过24秒）。在确认你上传的语音无安全问题后，我们将尽快将其潜在特征文件上传至链接1。

在未来几天，我们还将为一些常见的TTS基准测试准备并发布潜在特征表示。

💻 使用示例

基础用法

命令行标准用法

# p_w (intelligibility weight), t_w (similarity weight). Typically, prompt with more noises requires higher p_w and t_w
python tts/infer_cli.py --input_wav 'assets/Chinese_prompt.wav'  --input_text "另一边的桌上,一位读书人嗤之以鼻道,'佛子三藏,神子燕小鱼是什么样的人物,李家的那个李子夜如何与他们相提并论？'" --output_dir ./gen

# As long as audio volume and pronunciation are appropriate, increasing --t_w within reasonable ranges (2.0~5.0)
# will increase the generated speech's expressiveness and similarity (especially for some emotional cases).
python tts/infer_cli.py --input_wav 'assets/English_prompt.wav' --input_text 'As his long promised tariff threat turned into reality this week, top human advisers began fielding a wave of calls from business leaders, particularly in the automotive sector, along with lawmakers who were sounding the alarm.' --output_dir ./gen --p_w 2.0 --t_w 3.0

命令行带口音TTS用法

# When p_w (intelligibility weight) ≈ 1.0, the generated audio closely retains the speaker’s original accent. As p_w increases, it shifts toward standard pronunciation.
# t_w (similarity weight) is typically set 0–3 points higher than p_w for optimal results.
# Useful for accented TTS or solving the accent problems in cross-lingual TTS.
python tts/infer_cli.py --input_wav 'assets/English_prompt.wav' --input_text '这是一条有口音的音频。' --output_dir ./gen --p_w 1.0 --t_w 3.0

python tts/infer_cli.py --input_wav 'assets/English_prompt.wav' --input_text '这条音频的发音标准一些了吗？' --output_dir ./gen --p_w 2.5 --t_w 2.5

Web UI用法

# We also support cpu inference, but it may take about 30 seconds (for 10 inference steps).
python tts/gradio_api.py

📚 详细文档

论文：MegaTTS 3: Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis
项目页面（音频示例）：https://sditdemo.github.io/sditdemo/
GitHub仓库：https://github.com/bytedance/MegaTTS3
演示视频：Demo Video
Huggingface Space：https://huggingface.co/spaces/ByteDance/MegaTTS3

🔧 技术细节

本仓库包含Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis的强制对齐版本，WavVAE主要基于Wavtokenizer: an efficient acoustic discrete codec tokenizer for audio language modeling。与论文中描述的模型相比，该仓库包含了额外的模型。这些模型不仅增强了算法的稳定性和克隆能力，还可以独立用于更广泛的场景。

BibTeX引用

@article{jiang2025sparse,
  title={Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis},
  author={Jiang, Ziyue and Ren, Yi and Li, Ruiqi and Ji, Shengpeng and Ye, Zhenhui and Zhang, Chen and Jionghao, Bai and Yang, Xiaoda and Zuo, Jialong and Zhang, Yu and others},
  journal={arXiv preprint arXiv:2502.18924},
  year={2025}
}

@article{ji2024wavtokenizer,
  title={Wavtokenizer: an efficient acoustic discrete codec tokenizer for audio language modeling},
  author={Ji, Shengpeng and Jiang, Ziyue and Wang, Wen and Chen, Yifu and Fang, Minghui and Zuo, Jialong and Yang, Qian and Cheng, Xize and Wang, Zehan and Li, Ruiqi and others},
  journal={arXiv preprint arXiv:2408.16532},
  year={2024}
}