CAD-I开源文本生成图像模型 - 用策略增强数据，提升图像生成质量

首页

CAD I

由 Lucasdegeorge 开发

通过策略性数据增强方法在小规模精选数据集上训练的文本生成图像模型，显著提升生成质量

文本生成图像

Safetensors

开源协议:MIT #小数据集增强 #细节优化生成 #长文本适配

下载量 17

发布时间 : 3/4/2025

模型简介

该模型采用创新的数据增强方法，在小型精选数据集上实现高质量的文本到图像生成，突破了传统依赖海量数据的训练范式

模型特点

小数据集高效训练

通过精挑细选的小型数据集和策略性数据增强，实现媲美大数据集训练的生成效果

联合增强训练

同时采用文本增强与图像增强的联合训练方法，提升模型理解能力

细节生成优化

特别优化对超长细节描述的图像生成能力，适合复杂场景渲染

模型能力

文本生成图像

复杂场景渲染

高细节图像生成

使用案例

创意设计

概念艺术创作

根据详细文字描述生成高质量概念艺术图

可生成符合专业要求的场景概念图

教育应用

教学素材生成

根据教材内容自动生成配套插图

快速生成与教学内容匹配的视觉素材

🚀 利用 ImageNet 进行文本到图像生成，我们能走多远？

本项目聚焦文本到图像生成，提出利用精心挑选的小数据集进行策略性数据增强，以提升模型性能和生成图像质量的新方法。

🚀 快速开始

本仓库包含论文 “How far can we go with ImageNet for Text-to-Image generation?” 的代码和模型。核心思想是，文本到图像生成模型通常依赖大量数据集，更注重数量而非质量。常见的解决办法是收集海量数据。我们提出了一种新方法，通过对精心挑选的小数据集进行策略性数据增强，来提升这些模型的性能。我们的研究表明，该方法在多个基准测试中提高了生成图像的质量。

论文链接：Arxiv GitHub 仓库：https://github.com/lucasdegeorge/T2I-ImageNet 项目网站：https://lucasdegeorge.github.io/projects/t2i_imagenet/

📦 安装指南

首先，使用 Python（至少 3.9 版本）创建一个虚拟环境，克隆仓库，并运行以下命令：

pip install -e .

更多详细信息请参考此处。

📚 详细文档

预训练模型

CAD - I 模型

在本仓库中，该模型使用文本增强和图像增强进行训练。仅使用文本增强训练的模型请参考此处。若要使用预训练模型，请执行以下操作：

from pipe import T2IPipeline
pipe = T2IPipeline("Lucasdegeorge/CAD-I").to("cuda")
prompt = "An adorable otter, with its sleek, brown fur and bright, curious eyes, playfully interacts with a vibrant bunch of broccoli... "
image = pipe(prompt, cfg=15)

如果您只想下载模型，而不下载采样管道，可以执行以下操作：

from pipe import CAD
model = CAD.from_pretrained("Lucasdegeorge/CAD-I")

DiT - I 模型

即将推出...

提示词

我们的模型经过专门训练，能够处理非常长且详细的提示词。为了获得最佳性能和结果，建议您使用详细丰富的提示词。简短或模糊的提示词可能无法充分发挥模型的能力。

示例提示词：

A majestic elephant stands tall and proud in the heart of the African savannah, its wrinkled, gray skin glistening under the intense afternoon sun. The elephant's large, flapping ears and long, sweeping trunk create a sense of grace and power as it gently sways, surveying the vast, golden grasslands stretching out before it. In the distance, a herd of zebras grazes peacefully, their stripes blending with the tall, dry grass. The scene is completed by a lone acacia tree silhouetted against the setting sun, casting long, dramatic shadows across the landscape.
A classic film camera rests on a tripod, its worn leather strap and scratched metal body telling the story of countless adventures and captured moments. The camera is positioned in a scenic landscape, with rolling hills, a winding river, and a distant mountain range bathed in the soft, golden light of sunset. In the foreground, a wildflower meadow sways gently in the breeze, while the camera's lens captures the beauty and tranquility of the scene, preserving it for eternity.
A graceful flamingo stands elegantly in the shallow waters of a tranquil lagoon, its vibrant pink feathers reflecting beautifully in the still water. The flamingo's long, slender legs and curved neck create a picture of poise and balance as it dips its beak into the water, searching for food. Behind the flamingo, a lush mangrove forest stretches out, its dense foliage providing a rich habitat for various wildlife. The scene is completed by a clear blue sky and the gentle rustling of leaves in the breeze
A hearty, overstuffed sandwich sits on a wooden cutting board, its layers of fresh, crisp lettuce, juicy tomatoes, and thinly sliced deli meats peeking out from between two slices of golden-brown bread. The sandwich's tantalizing aroma fills the air, mingling with the scent of freshly baked bread and tangy mustard. In the background, a bustling deli comes to life, with shelves lined with jars of pickles, a gleaming meat slicer, and a chalkboard menu listing the day's specials. The scene is completed by the lively chatter of customers and the clinking of glasses.
A stunning oil painting of a majestic tiger hangs on the wall of a dimly-lit art gallery, its vibrant colors and intricate details drawing the viewer in. The tiger's powerful, muscular body is depicted in mid-stride, its stripes blending seamlessly with the lush jungle foliage surrounding it. The painting captures the tiger's intense, amber eyes and the subtle play of light and shadow on its fur, creating a sense of depth and movement. The background features a dense canopy of trees and a cascading waterfall, adding to the wild, untamed atmosphere of the scene.
A clever magpie perched on a rustic wooden fence post, its iridescent black and white feathers shimmering in the sunlight. The bird tilts its head, holding a shiny trinket in its beak, with a backdrop of a golden wheat field swaying gently in the breeze. A few more curios and found objects are scattered along the fence, hinting at the magpie's treasure trove hidden nearby. A clear blue sky with puffy white clouds completes the scenic countryside atmosphere.
A playful dolphin leaps gracefully out of the sparkling turquoise waters, its sleek, gray body arcing through the air before diving back into the waves with a splash. Nearby, a classic wooden sailboat glides smoothly across the ocean, its white sails billowing in the breeze. The dolphin swims alongside the boat, its joyful antics mirrored by the shimmering sunlight dancing on the water's surface. The scene is completed by a clear blue sky and the distant horizon, where the sea meets the sky

使用管道

T2IPipeline 类为从文本提示词生成图像提供了全面的接口。以下是使用它的详细指南：

💻 基础用法

from pipe import T2IPipeline
# 初始化管道
pipe = T2IPipeline("Lucasdegeorge/CAD-I").to("cuda")
# 从提示词生成图像
prompt = "An adorable otter, with its sleek, brown fur and bright, curious eyes, playfully interacts with a vibrant bunch of broccoli... "
image = pipe(prompt, cfg=15)

高级配置

管道可以使用多个自定义选项进行初始化：

pipe = T2IPipeline(
    model_path="Lucasdegeorge/CAD-I",
    sampler="ddim",                    # 选项: "ddim", "ddpm", "dpm", "dpm_2S", "dpm_2M"
    scheduler="sigmoid",               # 选项: "sigmoid", "cosine", "linear"
    postprocessing="sd_1_5_vae",
    scheduler_start=-3,
    scheduler_end=3,
    scheduler_tau=1.1,
    device="cuda"
)

生成参数

管道的 __call__ 方法接受各种参数来控制生成过程：

image = pipe(
    cond="A beautiful landscape",          # 文本提示词或提示词列表
    num_samples=4,                         # 要生成的图像数量
    cfg=15,                               # 无分类器引导比例
    guidance_type="constant",             # 引导类型: "constant", "linear"
    guidance_start_step=0,                # 开始引导的步骤
    coherence_value=1.0,                  # 采样的一致性值
    uncoherence_value=0.0,                # 采样的非一致性值
    thresholding_type="clamp",           # 阈值类型: "clamp", "dynamic_thresholding", "per_channel_dynamic_thresholding"
    clamp_value=1.0,                      # 阈值的钳位值
    thresholding_percentile=0.995         # 阈值的百分位数
)

引导类型

constant：在整个采样过程中应用统一的引导
linear：引导强度从开始到结束线性增加
exponential：引导强度从开始到结束指数增加

阈值类型

clamp：使用 clamp_value 将值钳位到固定范围
dynamic：根据批次统计信息动态调整阈值
percentile：使用基于百分位数的阈值，阈值百分位数为 thresholding_percentile

高级参数

为了更精细地控制生成过程，您还可以指定以下参数：

x_N：初始噪声张量
latents：用于继续生成的先前潜在变量
num_steps：自定义采样步骤数
sampler：自定义采样器函数
scheduler：自定义调度器函数
guidance_start_step：开始引导的步骤
generator：用于重现性的随机数生成器
unconfident_prompt：自定义无信心提示词文本

📄 许可证

本项目采用 MIT 许可证。

📚 引用

如果您在实验中使用了本仓库，请引用以下论文：

@article{degeorge2025farimagenettexttoimagegeneration, 
     title           ={How far can we go with ImageNet for Text-to-Image generation?}, 
     author          ={Lucas Degeorge and Arijit Ghosh and Nicolas Dufour and David Picard and Vicky Kalogeiton}, 
     year            ={2025}, 
     journal         ={arXiv},
 }