dc-ae-f32c32-sana-1.1開源自編碼器 - 加速高分辨率擴散模型，保障重建精度

首頁

Dc Ae F32c32 Sana 1.1

由mit-han-lab開發

DC-AE是一種用於加速高分辨率擴散模型的新型自編碼器架構，解決了高壓縮比下的重建精度問題

圖像生成

Safetensors

#高壓縮比自編碼 #殘差學習架構 #解耦訓練策略

下載量 18.17k

發布時間 : 1/24/2025

模型概述

該模型通過殘差自編碼和解耦高分辨率適配技術，顯著提升了自編碼器的空間壓縮比，同時保持重建質量，可大幅加速擴散模型的訓練和推理過程

模型特點

高壓縮比

支持高達128倍的空間壓縮比，遠超傳統自編碼器的8倍壓縮比

殘差自編碼

通過空間-通道變換特徵學習殘差，有效緩解高壓縮比下的優化難題

解耦訓練策略

採用三階段解耦訓練策略，減輕高壓縮比自編碼器的泛化懲罰

高效加速

在ImageNet 512x512數據集上實現19.1倍推理加速和17.9倍訓練加速

模型能力

高分辨率圖像壓縮

潛在空間特徵提取

圖像重建

加速擴散模型訓練

加速擴散模型推理

使用案例

計算機視覺

高分辨率圖像生成

用於加速高分辨率擴散模型的訓練和推理過程

在保持生成質量的同時顯著提升速度

圖像壓縮與重建

實現高壓縮比下的高質量圖像重建

128倍壓縮比下仍能保持良好重建質量

🚀 高效高分辨率擴散模型的深度壓縮自編碼器

本項目提出了一種新的自編碼器模型家族，用於加速高分辨率擴散模型，解決了現有自編碼器在高空間壓縮比下重建精度不足的問題，顯著提升了訓練和推理速度。

[論文] [GitHub]

本倉庫包含論文 SANA 1.5: Efficient Scaling of Training-Time and Inference-Time Compute in Linear Diffusion Transformer 中的模型。

項目頁面：https://hanlab.mit.edu/projects/sana/

demo

圖 1：我們解決了高空間壓縮自編碼器的重建精度下降問題。

圖 2：DC - AE 在不降低性能的情況下顯著加快了訓練和推理速度。

圖 3：DC - AE 使筆記本電腦能夠高效地進行文本到圖像的生成。

📚 摘要

我們提出了深度壓縮自編碼器（Deep Compression Autoencoder，DC - AE），這是一個用於加速高分辨率擴散模型的新型自編碼器模型家族。現有的自編碼器模型在中等空間壓縮比（如 8 倍）下取得了令人印象深刻的結果，但在高空間壓縮比（如 64 倍）下無法保持令人滿意的重建精度。我們通過引入兩項關鍵技術來應對這一挑戰：（1）殘差自編碼，我們設計模型基於空間到通道轉換後的特徵學習殘差，以緩解高空間壓縮自編碼器的優化難度；（2）解耦高分辨率自適應，一種高效的解耦三階段訓練策略，用於減輕高空間壓縮自編碼器的泛化懲罰。通過這些設計，我們將自編碼器的空間壓縮比提高到 128，同時保持了重建質量。將我們的 DC - AE 應用於潛在擴散模型，我們在不降低精度的情況下實現了顯著的加速。例如，在 ImageNet 512x512 上，與廣泛使用的 SD - VAE - f8 自編碼器相比，我們的 DC - AE 在 H100 GPU 上為 UViT - H 提供了 19.1 倍的推理加速和 17.9 倍的訓練加速，同時實現了更好的 FID。

💻 使用示例

基礎用法

# build DC-AE models # full DC-AE model list: https://huggingface.co/collections/mit-han-lab/dc-ae-670085b9400ad7197bb1009b from efficientvit.ae_model_zoo import DCAE_HF dc_ae = DCAE_HF.from_pretrained(f"mit-han-lab/dc-ae-f64c128-in-1.0") # encode from PIL import Image import torch import torchvision.transforms as transforms from torchvision.utils import save_image from efficientvit.apps.utils.image import DMCrop device = torch.device("cuda") dc_ae = dc_ae.to(device).eval() transform = transforms.Compose([ DMCrop(512), # resolution transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)), ]) image = Image.open("assets/fig/girl.png") x = transform(image)[None].to(device) latent = dc_ae.encode(x) print(latent.shape) # decode y = dc_ae.decode(latent) save_image(y * 0.5 + 0.5, "demo_dc_ae.png")

高級用法

# build DC-AE-Diffusion models # full DC-AE-Diffusion model list: https://huggingface.co/collections/mit-han-lab/dc-ae-diffusion-670dbb8d6b6914cf24c1a49d from efficientvit.diffusion_model_zoo import DCAE_Diffusion_HF dc_ae_diffusion = DCAE_Diffusion_HF.from_pretrained(f"mit-han-lab/dc-ae-f64c128-in-1.0-uvit-h-in-512px-train2000k") # denoising on the latent space import torch import numpy as np from torchvision.utils import save_image torch.set_grad_enabled(False) device = torch.device("cuda") dc_ae_diffusion = dc_ae_diffusion.to(device).eval() seed = 0 torch.manual_seed(seed) torch.cuda.manual_seed_all(seed) eval_generator = torch.Generator(device=device) eval_generator.manual_seed(seed) prompts = torch.tensor( [279, 333, 979, 936, 933, 145, 497, 1, 248, 360, 793, 12, 387, 437, 938, 978], dtype=torch.int, device=device ) num_samples = prompts.shape[0] prompts_null = 1000 * torch.ones((num_samples,), dtype=torch.int, device=device) latent_samples = dc_ae_diffusion.diffusion_model.generate(prompts, prompts_null, 6.0, eval_generator) latent_samples = latent_samples / dc_ae_diffusion.scaling_factor # decode image_samples = dc_ae_diffusion.autoencoder.decode(latent_samples) save_image(image_samples * 0.5 + 0.5, "demo_dc_ae_diffusion.png", nrow=int(np.sqrt(num_samples)))

📖 引用

如果 DC - AE 對你的研究有用或相關，請通過引用我們的論文來認可我們的貢獻：

@article{chen2024deep, title={Deep Compression Autoencoder for Efficient High-Resolution Diffusion Models}, author={Chen, Junyu and Cai, Han and Chen, Junsong and Xie, Enze and Yang, Shang and Tang, Haotian and Li, Muyang and Lu, Yao and Han, Song}, journal={arXiv preprint arXiv:2410.10733}, year={2024} }

關於 SANA 1.5 的工作可以按如下方式引用：

@misc{xie2025sana, title={SANA 1.5: Efficient Scaling of Training-Time and Inference-Time Compute in Linear Diffusion Transformer}, author={Enze Xie and Junyu Chen and Han Cai and Junsong Chen and Haotian Tang and Yao Lu and Song Han}, year={2025}, eprint={2501.18427}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2501.18427}, }

精選推薦AI模型

Llama 3 Typhoon V1.5x 8b Instruct
專為泰語設計的80億參數指令模型，性能媲美GPT-3.5-turbo，優化了應用場景、檢索增強生成、受限生成和推理任務
大型語言模型 Transformers 支持多種語言
L
scb10x
3,269
16
Cadet Tiny
Openrail
Cadet-Tiny是一個基於SODA數據集訓練的超小型對話模型，專為邊緣設備推理設計，體積僅為Cosmo-3B模型的2%左右。
對話系統 Transformers 英語
C
ToddGoldfarb
2,691
6
Roberta Base Chinese Extractive Qa
基於RoBERTa架構的中文抽取式問答模型，適用於從給定文本中提取答案的任務。
問答系統中文
R
uer
2,694
98

智啟未來，您的人工智能解決方案智庫
English 简体中文繁體中文にほんご