模型概述
模型特點
模型能力
使用案例
🚀 MegaTTS 3
MegaTTS 3是一個文本轉語音模型,基於稀疏對齊增強的潛在擴散Transformer架構,可實現零樣本語音合成。它能根據輸入的文本和語音提示生成高質量語音,在多種場景下有廣泛應用。
🚀 快速開始
本項目提供了MegaTTS 3模型的使用說明,涵蓋了安裝、推理等方面。你可以通過命令行或Web UI的方式使用該模型進行語音合成。
✨ 主要特性
- 零樣本語音合成:基於稀疏對齊增強的潛在擴散Transformer架構,實現零樣本語音合成。
- 多平臺支持:支持Linux、Windows和Docker環境。
- 多種推理方式:提供命令行和Web UI兩種推理方式。
- 口音控制:可以通過調整參數控制生成語音的口音。
📦 安裝指南
克隆倉庫
# Clone the repository
git clone https://github.com/bytedance/MegaTTS3
cd MegaTTS3
模型下載
huggingface-cli download ByteDance/MegaTTS3 --local-dir ./checkpoints --local-dir-use-symlinks False
Linux環境依賴安裝
# Create a python 3.10 conda env (you could also use virtualenv)
conda create -n megatts3-env python=3.10
conda activate megatts3-env
pip install -r requirements.txt
# Set the root directory
export PYTHONPATH="/path/to/MegaTTS3:$PYTHONPATH"
# [Optional] Set GPU
export CUDA_VISIBLE_DEVICES=0
# If you encounter bugs with pydantic in inference, you should check if the versions of pydantic and gradio are matched.
# [Note] if you encounter bugs related with httpx, please check that whether your environmental variable "no_proxy" has patterns like "::"
Windows環境依賴安裝
# [The Windows version is currently under testing]
# Comment below dependence in requirements.txt:
# # WeTextProcessing==1.0.4.1
# Create a python 3.10 conda env (you could also use virtualenv)
conda create -n megatts3-env python=3.10
conda activate megatts3-env
pip install -r requirements.txt
conda install -y -c conda-forge pynini==2.1.5
pip install WeTextProcessing==1.0.3
# [Optional] If you want GPU inference, you may need to install specific version of PyTorch for your GPU from https://pytorch.org/.
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126
# [Note] if you encounter bugs related with `ffprobe` or `ffmpeg`, you can install it through `conda install -c conda-forge ffmpeg`
# Set environment variable for root directory
set PYTHONPATH="C:\path\to\MegaTTS3;%PYTHONPATH%" # Windows
$env:PYTHONPATH="C:\path\to\MegaTTS3;%PYTHONPATH%" # Powershell on Windows
conda env config vars set PYTHONPATH="C:\path\to\MegaTTS3;%PYTHONPATH%" # For conda users
# [Optional] Set GPU
set CUDA_VISIBLE_DEVICES=0 # Windows
$env:CUDA_VISIBLE_DEVICES=0 # Powershell on Windows
Docker環境依賴安裝
# [The Docker version is currently under testing]
# ! You should download the pretrained checkpoint before running the following command
docker build . -t megatts3:latest
# For GPU inference
docker run -it -p 7929:7929 --gpus all -e CUDA_VISIBLE_DEVICES=0 megatts3:latest
# For CPU inference
docker run -it -p 7929:7929 megatts3:latest
# Visit http://0.0.0.0:7929/ for gradio.
⚠️ 重要提示
由於安全問題,我們未將WaveVAE編碼器的參數上傳至上述鏈接。你只能使用從鏈接1預提取的潛在特徵進行推理。如果你想為說話人A合成語音,需要將"A.wav"和"A.npy"放在同一目錄下。如果你對我們的模型有任何疑問或建議,請發郵件聯繫我們。
本項目主要用於學術目的。對於需要評估的學術數據集,你可以將其上傳至鏈接2的語音請求隊列(每個片段時長不超過24秒)。在確認你上傳的語音無安全問題後,我們將盡快將其潛在特徵文件上傳至鏈接1。
在未來幾天,我們還將為一些常見的TTS基準測試準備併發布潛在特徵表示。
💻 使用示例
基礎用法
命令行標準用法
# p_w (intelligibility weight), t_w (similarity weight). Typically, prompt with more noises requires higher p_w and t_w
python tts/infer_cli.py --input_wav 'assets/Chinese_prompt.wav' --input_text "另一邊的桌上,一位讀書人嗤之以鼻道,'佛子三藏,神子燕小魚是什麼樣的人物,李家的那個李子夜如何與他們相提並論?'" --output_dir ./gen
# As long as audio volume and pronunciation are appropriate, increasing --t_w within reasonable ranges (2.0~5.0)
# will increase the generated speech's expressiveness and similarity (especially for some emotional cases).
python tts/infer_cli.py --input_wav 'assets/English_prompt.wav' --input_text 'As his long promised tariff threat turned into reality this week, top human advisers began fielding a wave of calls from business leaders, particularly in the automotive sector, along with lawmakers who were sounding the alarm.' --output_dir ./gen --p_w 2.0 --t_w 3.0
命令行帶口音TTS用法
# When p_w (intelligibility weight) ≈ 1.0, the generated audio closely retains the speaker’s original accent. As p_w increases, it shifts toward standard pronunciation.
# t_w (similarity weight) is typically set 0–3 points higher than p_w for optimal results.
# Useful for accented TTS or solving the accent problems in cross-lingual TTS.
python tts/infer_cli.py --input_wav 'assets/English_prompt.wav' --input_text '這是一條有口音的音頻。' --output_dir ./gen --p_w 1.0 --t_w 3.0
python tts/infer_cli.py --input_wav 'assets/English_prompt.wav' --input_text '這條音頻的發音標準一些了嗎?' --output_dir ./gen --p_w 2.5 --t_w 2.5
Web UI用法
# We also support cpu inference, but it may take about 30 seconds (for 10 inference steps).
python tts/gradio_api.py
📚 詳細文檔
- 論文:MegaTTS 3: Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis
- 項目頁面(音頻示例):https://sditdemo.github.io/sditdemo/
- GitHub倉庫:https://github.com/bytedance/MegaTTS3
- 演示視頻:Demo Video
- Huggingface Space:https://huggingface.co/spaces/ByteDance/MegaTTS3
🔧 技術細節
本倉庫包含Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis
的強制對齊版本,WavVAE主要基於Wavtokenizer: an efficient acoustic discrete codec tokenizer for audio language modeling
。與論文中描述的模型相比,該倉庫包含了額外的模型。這些模型不僅增強了算法的穩定性和克隆能力,還可以獨立用於更廣泛的場景。
BibTeX引用
@article{jiang2025sparse,
title={Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis},
author={Jiang, Ziyue and Ren, Yi and Li, Ruiqi and Ji, Shengpeng and Ye, Zhenhui and Zhang, Chen and Jionghao, Bai and Yang, Xiaoda and Zuo, Jialong and Zhang, Yu and others},
journal={arXiv preprint arXiv:2502.18924},
year={2025}
}
@article{ji2024wavtokenizer,
title={Wavtokenizer: an efficient acoustic discrete codec tokenizer for audio language modeling},
author={Ji, Shengpeng and Jiang, Ziyue and Wang, Wen and Chen, Yifu and Fang, Minghui and Zuo, Jialong and Yang, Qian and Cheng, Xize and Wang, Zehan and Li, Ruiqi and others},
journal={arXiv preprint arXiv:2408.16532},
year={2024}
}
📄 許可證
本項目採用Apache-2.0許可證。
🔒 安全說明
如果你發現本項目存在潛在的安全問題,或者認為自己可能發現了安全問題,請通過我們的安全中心或sec@bytedance.com通知字節跳動安全團隊。
請不要創建公開問題。




