模型简介
模型特点
模型能力
使用案例
🚀 MegaTTS 3
MegaTTS 3是一个文本转语音模型,基于稀疏对齐增强的潜在扩散Transformer架构,可实现零样本语音合成。它能根据输入的文本和语音提示生成高质量语音,在多种场景下有广泛应用。
🚀 快速开始
本项目提供了MegaTTS 3模型的使用说明,涵盖了安装、推理等方面。你可以通过命令行或Web UI的方式使用该模型进行语音合成。
✨ 主要特性
- 零样本语音合成:基于稀疏对齐增强的潜在扩散Transformer架构,实现零样本语音合成。
- 多平台支持:支持Linux、Windows和Docker环境。
- 多种推理方式:提供命令行和Web UI两种推理方式。
- 口音控制:可以通过调整参数控制生成语音的口音。
📦 安装指南
克隆仓库
# Clone the repository
git clone https://github.com/bytedance/MegaTTS3
cd MegaTTS3
模型下载
huggingface-cli download ByteDance/MegaTTS3 --local-dir ./checkpoints --local-dir-use-symlinks False
Linux环境依赖安装
# Create a python 3.10 conda env (you could also use virtualenv)
conda create -n megatts3-env python=3.10
conda activate megatts3-env
pip install -r requirements.txt
# Set the root directory
export PYTHONPATH="/path/to/MegaTTS3:$PYTHONPATH"
# [Optional] Set GPU
export CUDA_VISIBLE_DEVICES=0
# If you encounter bugs with pydantic in inference, you should check if the versions of pydantic and gradio are matched.
# [Note] if you encounter bugs related with httpx, please check that whether your environmental variable "no_proxy" has patterns like "::"
Windows环境依赖安装
# [The Windows version is currently under testing]
# Comment below dependence in requirements.txt:
# # WeTextProcessing==1.0.4.1
# Create a python 3.10 conda env (you could also use virtualenv)
conda create -n megatts3-env python=3.10
conda activate megatts3-env
pip install -r requirements.txt
conda install -y -c conda-forge pynini==2.1.5
pip install WeTextProcessing==1.0.3
# [Optional] If you want GPU inference, you may need to install specific version of PyTorch for your GPU from https://pytorch.org/.
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126
# [Note] if you encounter bugs related with `ffprobe` or `ffmpeg`, you can install it through `conda install -c conda-forge ffmpeg`
# Set environment variable for root directory
set PYTHONPATH="C:\path\to\MegaTTS3;%PYTHONPATH%" # Windows
$env:PYTHONPATH="C:\path\to\MegaTTS3;%PYTHONPATH%" # Powershell on Windows
conda env config vars set PYTHONPATH="C:\path\to\MegaTTS3;%PYTHONPATH%" # For conda users
# [Optional] Set GPU
set CUDA_VISIBLE_DEVICES=0 # Windows
$env:CUDA_VISIBLE_DEVICES=0 # Powershell on Windows
Docker环境依赖安装
# [The Docker version is currently under testing]
# ! You should download the pretrained checkpoint before running the following command
docker build . -t megatts3:latest
# For GPU inference
docker run -it -p 7929:7929 --gpus all -e CUDA_VISIBLE_DEVICES=0 megatts3:latest
# For CPU inference
docker run -it -p 7929:7929 megatts3:latest
# Visit http://0.0.0.0:7929/ for gradio.
⚠️ 重要提示
由于安全问题,我们未将WaveVAE编码器的参数上传至上述链接。你只能使用从链接1预提取的潜在特征进行推理。如果你想为说话人A合成语音,需要将"A.wav"和"A.npy"放在同一目录下。如果你对我们的模型有任何疑问或建议,请发邮件联系我们。
本项目主要用于学术目的。对于需要评估的学术数据集,你可以将其上传至链接2的语音请求队列(每个片段时长不超过24秒)。在确认你上传的语音无安全问题后,我们将尽快将其潜在特征文件上传至链接1。
在未来几天,我们还将为一些常见的TTS基准测试准备并发布潜在特征表示。
💻 使用示例
基础用法
命令行标准用法
# p_w (intelligibility weight), t_w (similarity weight). Typically, prompt with more noises requires higher p_w and t_w
python tts/infer_cli.py --input_wav 'assets/Chinese_prompt.wav' --input_text "另一边的桌上,一位读书人嗤之以鼻道,'佛子三藏,神子燕小鱼是什么样的人物,李家的那个李子夜如何与他们相提并论?'" --output_dir ./gen
# As long as audio volume and pronunciation are appropriate, increasing --t_w within reasonable ranges (2.0~5.0)
# will increase the generated speech's expressiveness and similarity (especially for some emotional cases).
python tts/infer_cli.py --input_wav 'assets/English_prompt.wav' --input_text 'As his long promised tariff threat turned into reality this week, top human advisers began fielding a wave of calls from business leaders, particularly in the automotive sector, along with lawmakers who were sounding the alarm.' --output_dir ./gen --p_w 2.0 --t_w 3.0
命令行带口音TTS用法
# When p_w (intelligibility weight) ≈ 1.0, the generated audio closely retains the speaker’s original accent. As p_w increases, it shifts toward standard pronunciation.
# t_w (similarity weight) is typically set 0–3 points higher than p_w for optimal results.
# Useful for accented TTS or solving the accent problems in cross-lingual TTS.
python tts/infer_cli.py --input_wav 'assets/English_prompt.wav' --input_text '这是一条有口音的音频。' --output_dir ./gen --p_w 1.0 --t_w 3.0
python tts/infer_cli.py --input_wav 'assets/English_prompt.wav' --input_text '这条音频的发音标准一些了吗?' --output_dir ./gen --p_w 2.5 --t_w 2.5
Web UI用法
# We also support cpu inference, but it may take about 30 seconds (for 10 inference steps).
python tts/gradio_api.py
📚 详细文档
- 论文:MegaTTS 3: Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis
- 项目页面(音频示例):https://sditdemo.github.io/sditdemo/
- GitHub仓库:https://github.com/bytedance/MegaTTS3
- 演示视频:Demo Video
- Huggingface Space:https://huggingface.co/spaces/ByteDance/MegaTTS3
🔧 技术细节
本仓库包含Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis
的强制对齐版本,WavVAE主要基于Wavtokenizer: an efficient acoustic discrete codec tokenizer for audio language modeling
。与论文中描述的模型相比,该仓库包含了额外的模型。这些模型不仅增强了算法的稳定性和克隆能力,还可以独立用于更广泛的场景。
BibTeX引用
@article{jiang2025sparse,
title={Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis},
author={Jiang, Ziyue and Ren, Yi and Li, Ruiqi and Ji, Shengpeng and Ye, Zhenhui and Zhang, Chen and Jionghao, Bai and Yang, Xiaoda and Zuo, Jialong and Zhang, Yu and others},
journal={arXiv preprint arXiv:2502.18924},
year={2025}
}
@article{ji2024wavtokenizer,
title={Wavtokenizer: an efficient acoustic discrete codec tokenizer for audio language modeling},
author={Ji, Shengpeng and Jiang, Ziyue and Wang, Wen and Chen, Yifu and Fang, Minghui and Zuo, Jialong and Yang, Qian and Cheng, Xize and Wang, Zehan and Li, Ruiqi and others},
journal={arXiv preprint arXiv:2408.16532},
year={2024}
}
📄 许可证
本项目采用Apache-2.0许可证。
🔒 安全说明
如果你发现本项目存在潜在的安全问题,或者认为自己可能发现了安全问题,请通过我们的安全中心或sec@bytedance.com通知字节跳动安全团队。
请不要创建公开问题。




