dc-ae-f32c32-sana-1.1-diffusers开源模型 - 加速高分辨率扩散提升重建质量

首页

Dc Ae F32c32 Sana 1.1 Diffusers

由 mit-han-lab 开发

DC-AE是一种用于加速高分辨率扩散模型的新型自编码器架构，通过残差自编码和解耦高分辨率适配技术，在高空间压缩比下保持重建质量。

图像生成开源协议:MIT #高压缩比自编码 #残差学习优化 #高效扩散模型

下载量 1,127

发布时间 : 1/24/2025

模型简介

DC-AE解决了高空间压缩比自编码器重建精度下降的问题，显著加速扩散模型的训练和推理过程，同时保持图像生成质量。

模型特点

高压缩比重建

支持高达128倍的空间压缩比，同时保持高质量图像重建能力

残差自编码

基于空间-通道变换特征学习残差，缓解高压缩比自编码器的优化难题

解耦高分辨率适配

采用三阶段解耦训练策略，减轻高压缩比自编码器的泛化惩罚

高效推理

相比SD-VAE-f8自编码器，为UViT-H模型带来19.1倍推理加速

模型能力

高分辨率图像生成

图像压缩与重建

高效扩散模型加速

使用案例

创意内容生成

艺术创作

快速生成高质量艺术图像

512x512分辨率图像生成

工业设计

产品原型设计

基于文本描述生成产品设计概念图

高保真度图像输出

🚀 高效高分辨率扩散模型的深度压缩自编码器

本项目提出了深度压缩自编码器（DC - AE），这是一种用于加速高分辨率扩散模型的新型自编码器模型。现有自编码器模型在中等空间压缩比下表现出色，但在高空间压缩比时难以保持令人满意的重建精度。DC - AE通过引入关键技术解决了这一挑战，提升了自编码器的空间压缩比，同时保持了重建质量。将其应用于潜在扩散模型，可在不降低精度的情况下显著提高速度。

🚀 快速开始

项目相关链接

项目演示

demo

图1：我们解决了高空间压缩自编码器的重建精度下降问题。

图2：DC - AE在不降低性能的情况下显著提高了训练和推理速度。

图3：DC - AE可在笔记本电脑上实现高效的文本到图像生成。

✨ 主要特性

我们提出了深度压缩自编码器（DC - AE），这是一类用于加速高分辨率扩散模型的新型自编码器模型。现有的自编码器模型在中等空间压缩比（例如8倍）下取得了令人印象深刻的结果，但在高空间压缩比（例如64倍）下无法保持令人满意的重建精度。我们通过引入两项关键技术解决了这一挑战：

残差自编码：我们设计模型基于空间到通道变换后的特征学习残差，以缓解高空间压缩自编码器的优化难度。

解耦高分辨率自适应：一种高效的解耦三阶段训练策略，用于减轻高空间压缩自编码器的泛化惩罚。

通过这些设计，我们将自编码器的空间压缩比提高到了128，同时保持了重建质量。将我们的DC - AE应用于潜在扩散模型，我们在不降低精度的情况下实现了显著的加速。例如，在ImageNet 512x512上，与广泛使用的SD - VAE - f8自编码器相比，我们的DC - AE在H100 GPU上为UViT - H提供了19.1倍的推理加速和17.9倍的训练加速，同时实现了更好的FID。

💻 使用示例

基础用法

深度压缩自编码器

# build DC-AE models # full DC-AE model list: https://huggingface.co/collections/mit-han-lab/dc-ae-670085b9400ad7197bb1009b from efficientvit.ae_model_zoo import DCAE_HF dc_ae = DCAE_HF.from_pretrained(f"mit-han-lab/dc-ae-f64c128-in-1.0") # encode from PIL import Image import torch import torchvision.transforms as transforms from torchvision.utils import save_image from efficientvit.apps.utils.image import DMCrop device = torch.device("cuda") dc_ae = dc_ae.to(device).eval() transform = transforms.Compose([ DMCrop(512), # resolution transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)), ]) image = Image.open("assets/fig/girl.png") x = transform(image)[None].to(device) latent = dc_ae.encode(x) print(latent.shape) # decode y = dc_ae.decode(latent) save_image(y * 0.5 + 0.5, "demo_dc_ae.png")

高级用法

带有DC - AE的高效扩散模型

# build DC-AE-Diffusion models # full DC-AE-Diffusion model list: https://huggingface.co/collections/mit-han-lab/dc-ae-diffusion-670dbb8d6b6914cf24c1a49d from efficientvit.diffusion_model_zoo import DCAE_Diffusion_HF dc_ae_diffusion = DCAE_Diffusion_HF.from_pretrained(f"mit-han-lab/dc-ae-f64c128-in-1.0-uvit-h-in-512px-train2000k") # denoising on the latent space import torch import numpy as np from torchvision.utils import save_image torch.set_grad_enabled(False) device = torch.device("cuda") dc_ae_diffusion = dc_ae_diffusion.to(device).eval() seed = 0 torch.manual_seed(seed) torch.cuda.manual_seed_all(seed) eval_generator = torch.Generator(device=device) eval_generator.manual_seed(seed) prompts = torch.tensor( [279, 333, 979, 936, 933, 145, 497, 1, 248, 360, 793, 12, 387, 437, 938, 978], dtype=torch.int, device=device ) num_samples = prompts.shape[0] prompts_null = 1000 * torch.ones((num_samples,), dtype=torch.int, device=device) latent_samples = dc_ae_diffusion.diffusion_model.generate(prompts, prompts_null, 6.0, eval_generator) latent_samples = latent_samples / dc_ae_diffusion.scaling_factor # decode image_samples = dc_ae_diffusion.autoencoder.decode(latent_samples) save_image(image_samples * 0.5 + 0.5, "demo_dc_ae_diffusion.png", nrow=int(np.sqrt(num_samples)))

📄 许可证

本项目采用MIT许可证。

📚 详细文档

如果DC - AE对你的研究有用或相关，请通过引用我们的论文来认可我们的贡献：

@article{chen2024deep, title={Deep Compression Autoencoder for Efficient High-Resolution Diffusion Models}, author={Chen, Junyu and Cai, Han and Chen, Junsong and Xie, Enze and Yang, Shang and Tang, Haotian and Li, Muyang and Lu, Yao and Han, Song}, journal={arXiv preprint arXiv:2410.10733}, year={2024} }

精选推荐AI模型

Llama 3 Typhoon V1.5x 8b Instruct
专为泰语设计的80亿参数指令模型，性能媲美GPT-3.5-turbo，优化了应用场景、检索增强生成、受限生成和推理任务
大型语言模型 Transformers 支持多种语言
L
scb10x
3,269
16
Cadet Tiny
Openrail
Cadet-Tiny是一个基于SODA数据集训练的超小型对话模型，专为边缘设备推理设计，体积仅为Cosmo-3B模型的2%左右。
对话系统 Transformers 英语
C
ToddGoldfarb
2,691
6
Roberta Base Chinese Extractive Qa
基于RoBERTa架构的中文抽取式问答模型，适用于从给定文本中提取答案的任务。
问答系统中文
R
uer
2,694
98

智启未来，您的人工智能解决方案智库
English 简体中文繁體中文にほんご