MegaTTS3開源語音合成模型 - 免費支持中英文語音零樣本合成

首頁

Megatts3

由RedbeardNZ開發

MegaTTS 3是一個基於稀疏對齊增強的潛在擴散Transformer的零樣本語音合成模型，支持中英文語音合成。

語音合成

Safetensors

支持多種語言開源協議:Apache-2.0 #零樣本語音合成 #跨語言TTS #口音保留

下載量 26

發布時間 : 4/20/2025

模型概述

MegaTTS 3是一個先進的文本轉語音模型，採用潛在擴散Transformer架構，通過稀疏對齊技術增強，能夠實現高質量的零樣本語音合成。

模型特點

零樣本語音合成

無需針對特定說話人進行訓練，即可合成高質量的語音

跨語言支持

支持中英文語音合成，並能處理帶口音的語音

稀疏對齊增強

採用稀疏對齊技術提高語音合成的自然度和表現力

潛在擴散Transformer

結合潛在擴散模型和Transformer架構，實現高質量的語音生成

模型能力

文本轉語音

零樣本語音克隆

跨語言語音合成

帶口音語音合成

使用案例

語音合成

個性化語音生成

根據少量參考音頻生成個性化的語音

生成自然流暢的個性化語音

跨語言語音合成

使用一種語言的參考音頻合成另一種語言的語音

保持說話人特徵的同時實現跨語言合成

情感語音合成

通過調整參數控制生成語音的情感表現

生成富有表現力的情感語音

🚀 MegaTTS 3

MegaTTS 3是一個文本轉語音模型，基於稀疏對齊增強的潛在擴散Transformer架構，可實現零樣本語音合成。它能根據輸入的文本和語音提示生成高質量語音，在多種場景下有廣泛應用。

🚀 快速開始

本項目提供了MegaTTS 3模型的使用說明，涵蓋了安裝、推理等方面。你可以通過命令行或Web UI的方式使用該模型進行語音合成。

✨ 主要特性

零樣本語音合成：基於稀疏對齊增強的潛在擴散Transformer架構，實現零樣本語音合成。
多平臺支持：支持Linux、Windows和Docker環境。
多種推理方式：提供命令行和Web UI兩種推理方式。
口音控制：可以通過調整參數控制生成語音的口音。

📦 安裝指南

克隆倉庫

# Clone the repository
git clone https://github.com/bytedance/MegaTTS3
cd MegaTTS3

模型下載

huggingface-cli download ByteDance/MegaTTS3 --local-dir ./checkpoints --local-dir-use-symlinks False

Linux環境依賴安裝

# Create a python 3.10 conda env (you could also use virtualenv)
conda create -n megatts3-env python=3.10
conda activate megatts3-env
pip install -r requirements.txt

# Set the root directory
export PYTHONPATH="/path/to/MegaTTS3:$PYTHONPATH"

# [Optional] Set GPU
export CUDA_VISIBLE_DEVICES=0

# If you encounter bugs with pydantic in inference, you should check if the versions of pydantic and gradio are matched.
# [Note] if you encounter bugs related with httpx, please check that whether your environmental variable "no_proxy" has patterns like "::"

Windows環境依賴安裝

# [The Windows version is currently under testing]
# Comment below dependence in requirements.txt:
# # WeTextProcessing==1.0.4.1

# Create a python 3.10 conda env (you could also use virtualenv)
conda create -n megatts3-env python=3.10
conda activate megatts3-env
pip install -r requirements.txt
conda install -y -c conda-forge pynini==2.1.5
pip install WeTextProcessing==1.0.3

# [Optional] If you want GPU inference, you may need to install specific version of PyTorch for your GPU from https://pytorch.org/.
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126

# [Note] if you encounter bugs related with `ffprobe` or `ffmpeg`, you can install it through `conda install -c conda-forge ffmpeg`

# Set environment variable for root directory
set PYTHONPATH="C:\path\to\MegaTTS3;%PYTHONPATH%" # Windows
$env:PYTHONPATH="C:\path\to\MegaTTS3;%PYTHONPATH%" # Powershell on Windows
conda env config vars set PYTHONPATH="C:\path\to\MegaTTS3;%PYTHONPATH%" # For conda users

# [Optional] Set GPU
set CUDA_VISIBLE_DEVICES=0 # Windows
$env:CUDA_VISIBLE_DEVICES=0 # Powershell on Windows

Docker環境依賴安裝

# [The Docker version is currently under testing]
# ! You should download the pretrained checkpoint before running the following command
docker build . -t megatts3:latest

# For GPU inference
docker run -it -p 7929:7929 --gpus all -e CUDA_VISIBLE_DEVICES=0 megatts3:latest
# For CPU inference
docker run -it -p 7929:7929  megatts3:latest

# Visit http://0.0.0.0:7929/ for gradio.

⚠️ 重要提示

由於安全問題，我們未將WaveVAE編碼器的參數上傳至上述鏈接。你只能使用從鏈接1預提取的潛在特徵進行推理。如果你想為說話人A合成語音，需要將"A.wav"和"A.npy"放在同一目錄下。如果你對我們的模型有任何疑問或建議，請發郵件聯繫我們。

本項目主要用於學術目的。對於需要評估的學術數據集，你可以將其上傳至鏈接2的語音請求隊列（每個片段時長不超過24秒）。在確認你上傳的語音無安全問題後，我們將盡快將其潛在特徵文件上傳至鏈接1。

在未來幾天，我們還將為一些常見的TTS基準測試準備併發布潛在特徵表示。

💻 使用示例

基礎用法

命令行標準用法

# p_w (intelligibility weight), t_w (similarity weight). Typically, prompt with more noises requires higher p_w and t_w
python tts/infer_cli.py --input_wav 'assets/Chinese_prompt.wav'  --input_text "另一邊的桌上,一位讀書人嗤之以鼻道,'佛子三藏,神子燕小魚是什麼樣的人物,李家的那個李子夜如何與他們相提並論？'" --output_dir ./gen

# As long as audio volume and pronunciation are appropriate, increasing --t_w within reasonable ranges (2.0~5.0)
# will increase the generated speech's expressiveness and similarity (especially for some emotional cases).
python tts/infer_cli.py --input_wav 'assets/English_prompt.wav' --input_text 'As his long promised tariff threat turned into reality this week, top human advisers began fielding a wave of calls from business leaders, particularly in the automotive sector, along with lawmakers who were sounding the alarm.' --output_dir ./gen --p_w 2.0 --t_w 3.0

命令行帶口音TTS用法

# When p_w (intelligibility weight) ≈ 1.0, the generated audio closely retains the speaker’s original accent. As p_w increases, it shifts toward standard pronunciation.
# t_w (similarity weight) is typically set 0–3 points higher than p_w for optimal results.
# Useful for accented TTS or solving the accent problems in cross-lingual TTS.
python tts/infer_cli.py --input_wav 'assets/English_prompt.wav' --input_text '這是一條有口音的音頻。' --output_dir ./gen --p_w 1.0 --t_w 3.0

python tts/infer_cli.py --input_wav 'assets/English_prompt.wav' --input_text '這條音頻的發音標準一些了嗎？' --output_dir ./gen --p_w 2.5 --t_w 2.5

Web UI用法

# We also support cpu inference, but it may take about 30 seconds (for 10 inference steps).
python tts/gradio_api.py

📚 詳細文檔

論文：MegaTTS 3: Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis
項目頁面（音頻示例）：https://sditdemo.github.io/sditdemo/
GitHub倉庫：https://github.com/bytedance/MegaTTS3
演示視頻：Demo Video
Huggingface Space：https://huggingface.co/spaces/ByteDance/MegaTTS3

🔧 技術細節

本倉庫包含Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis的強制對齊版本，WavVAE主要基於Wavtokenizer: an efficient acoustic discrete codec tokenizer for audio language modeling。與論文中描述的模型相比，該倉庫包含了額外的模型。這些模型不僅增強了算法的穩定性和克隆能力，還可以獨立用於更廣泛的場景。

BibTeX引用

@article{jiang2025sparse,
  title={Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis},
  author={Jiang, Ziyue and Ren, Yi and Li, Ruiqi and Ji, Shengpeng and Ye, Zhenhui and Zhang, Chen and Jionghao, Bai and Yang, Xiaoda and Zuo, Jialong and Zhang, Yu and others},
  journal={arXiv preprint arXiv:2502.18924},
  year={2025}
}

@article{ji2024wavtokenizer,
  title={Wavtokenizer: an efficient acoustic discrete codec tokenizer for audio language modeling},
  author={Ji, Shengpeng and Jiang, Ziyue and Wang, Wen and Chen, Yifu and Fang, Minghui and Zuo, Jialong and Yang, Qian and Cheng, Xize and Wang, Zehan and Li, Ruiqi and others},
  journal={arXiv preprint arXiv:2408.16532},
  year={2024}
}