dc-ae-f32c32-sana-1.1开源自编码器 - 加速高分辨率扩散模型，保障重建精度

首页

Dc Ae F32c32 Sana 1.1

由 mit-han-lab 开发

DC-AE是一种用于加速高分辨率扩散模型的新型自编码器架构，解决了高压缩比下的重建精度问题

图像生成

Safetensors

#高压缩比自编码 #残差学习架构 #解耦训练策略

下载量 18.17k

发布时间 : 1/24/2025

模型简介

该模型通过残差自编码和解耦高分辨率适配技术，显著提升了自编码器的空间压缩比，同时保持重建质量，可大幅加速扩散模型的训练和推理过程

模型特点

高压缩比

支持高达128倍的空间压缩比，远超传统自编码器的8倍压缩比

残差自编码

通过空间-通道变换特征学习残差，有效缓解高压缩比下的优化难题

解耦训练策略

采用三阶段解耦训练策略，减轻高压缩比自编码器的泛化惩罚

高效加速

在ImageNet 512x512数据集上实现19.1倍推理加速和17.9倍训练加速

模型能力

高分辨率图像压缩

潜在空间特征提取

图像重建

加速扩散模型训练

加速扩散模型推理

使用案例

计算机视觉

高分辨率图像生成

用于加速高分辨率扩散模型的训练和推理过程

在保持生成质量的同时显著提升速度

图像压缩与重建

实现高压缩比下的高质量图像重建

128倍压缩比下仍能保持良好重建质量

🚀 高效高分辨率扩散模型的深度压缩自编码器

本项目提出了一种新的自编码器模型家族，用于加速高分辨率扩散模型，解决了现有自编码器在高空间压缩比下重建精度不足的问题，显著提升了训练和推理速度。

[论文] [GitHub]

本仓库包含论文 SANA 1.5: Efficient Scaling of Training-Time and Inference-Time Compute in Linear Diffusion Transformer 中的模型。

项目页面：https://hanlab.mit.edu/projects/sana/

demo

图 1：我们解决了高空间压缩自编码器的重建精度下降问题。

图 2：DC - AE 在不降低性能的情况下显著加快了训练和推理速度。

图 3：DC - AE 使笔记本电脑能够高效地进行文本到图像的生成。

📚 摘要

我们提出了深度压缩自编码器（Deep Compression Autoencoder，DC - AE），这是一个用于加速高分辨率扩散模型的新型自编码器模型家族。现有的自编码器模型在中等空间压缩比（如 8 倍）下取得了令人印象深刻的结果，但在高空间压缩比（如 64 倍）下无法保持令人满意的重建精度。我们通过引入两项关键技术来应对这一挑战：（1）残差自编码，我们设计模型基于空间到通道转换后的特征学习残差，以缓解高空间压缩自编码器的优化难度；（2）解耦高分辨率自适应，一种高效的解耦三阶段训练策略，用于减轻高空间压缩自编码器的泛化惩罚。通过这些设计，我们将自编码器的空间压缩比提高到 128，同时保持了重建质量。将我们的 DC - AE 应用于潜在扩散模型，我们在不降低精度的情况下实现了显著的加速。例如，在 ImageNet 512x512 上，与广泛使用的 SD - VAE - f8 自编码器相比，我们的 DC - AE 在 H100 GPU 上为 UViT - H 提供了 19.1 倍的推理加速和 17.9 倍的训练加速，同时实现了更好的 FID。

💻 使用示例

基础用法

# build DC-AE models # full DC-AE model list: https://huggingface.co/collections/mit-han-lab/dc-ae-670085b9400ad7197bb1009b from efficientvit.ae_model_zoo import DCAE_HF dc_ae = DCAE_HF.from_pretrained(f"mit-han-lab/dc-ae-f64c128-in-1.0") # encode from PIL import Image import torch import torchvision.transforms as transforms from torchvision.utils import save_image from efficientvit.apps.utils.image import DMCrop device = torch.device("cuda") dc_ae = dc_ae.to(device).eval() transform = transforms.Compose([ DMCrop(512), # resolution transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)), ]) image = Image.open("assets/fig/girl.png") x = transform(image)[None].to(device) latent = dc_ae.encode(x) print(latent.shape) # decode y = dc_ae.decode(latent) save_image(y * 0.5 + 0.5, "demo_dc_ae.png")

高级用法

# build DC-AE-Diffusion models # full DC-AE-Diffusion model list: https://huggingface.co/collections/mit-han-lab/dc-ae-diffusion-670dbb8d6b6914cf24c1a49d from efficientvit.diffusion_model_zoo import DCAE_Diffusion_HF dc_ae_diffusion = DCAE_Diffusion_HF.from_pretrained(f"mit-han-lab/dc-ae-f64c128-in-1.0-uvit-h-in-512px-train2000k") # denoising on the latent space import torch import numpy as np from torchvision.utils import save_image torch.set_grad_enabled(False) device = torch.device("cuda") dc_ae_diffusion = dc_ae_diffusion.to(device).eval() seed = 0 torch.manual_seed(seed) torch.cuda.manual_seed_all(seed) eval_generator = torch.Generator(device=device) eval_generator.manual_seed(seed) prompts = torch.tensor( [279, 333, 979, 936, 933, 145, 497, 1, 248, 360, 793, 12, 387, 437, 938, 978], dtype=torch.int, device=device ) num_samples = prompts.shape[0] prompts_null = 1000 * torch.ones((num_samples,), dtype=torch.int, device=device) latent_samples = dc_ae_diffusion.diffusion_model.generate(prompts, prompts_null, 6.0, eval_generator) latent_samples = latent_samples / dc_ae_diffusion.scaling_factor # decode image_samples = dc_ae_diffusion.autoencoder.decode(latent_samples) save_image(image_samples * 0.5 + 0.5, "demo_dc_ae_diffusion.png", nrow=int(np.sqrt(num_samples)))

📖 引用

如果 DC - AE 对你的研究有用或相关，请通过引用我们的论文来认可我们的贡献：

@article{chen2024deep, title={Deep Compression Autoencoder for Efficient High-Resolution Diffusion Models}, author={Chen, Junyu and Cai, Han and Chen, Junsong and Xie, Enze and Yang, Shang and Tang, Haotian and Li, Muyang and Lu, Yao and Han, Song}, journal={arXiv preprint arXiv:2410.10733}, year={2024} }

关于 SANA 1.5 的工作可以按如下方式引用：

@misc{xie2025sana, title={SANA 1.5: Efficient Scaling of Training-Time and Inference-Time Compute in Linear Diffusion Transformer}, author={Enze Xie and Junyu Chen and Han Cai and Junsong Chen and Haotian Tang and Yao Lu and Song Han}, year={2025}, eprint={2501.18427}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2501.18427}, }

精选推荐AI模型

Llama 3 Typhoon V1.5x 8b Instruct
专为泰语设计的80亿参数指令模型，性能媲美GPT-3.5-turbo，优化了应用场景、检索增强生成、受限生成和推理任务
大型语言模型 Transformers 支持多种语言
L
scb10x
3,269
16
Cadet Tiny
Openrail
Cadet-Tiny是一个基于SODA数据集训练的超小型对话模型，专为边缘设备推理设计，体积仅为Cosmo-3B模型的2%左右。
对话系统 Transformers 英语
C
ToddGoldfarb
2,691
6
Roberta Base Chinese Extractive Qa
基于RoBERTa架构的中文抽取式问答模型，适用于从给定文本中提取答案的任务。
问答系统中文
R
uer
2,694
98

智启未来，您的人工智能解决方案智库
English 简体中文繁體中文にほんご