CAD-I開源文本生成圖像模型 - 用策略增強數據，提升圖像生成質量

首頁

CAD I

由Lucasdegeorge開發

通過策略性數據增強方法在小規模精選數據集上訓練的文本生成圖像模型，顯著提升生成質量

文本生成圖像

Safetensors

開源協議:MIT #小數據集增強 #細節優化生成 #長文本適配

下載量 17

發布時間 : 3/4/2025

模型概述

該模型採用創新的數據增強方法，在小型精選數據集上實現高質量的文本到圖像生成，突破了傳統依賴海量數據的訓練範式

模型特點

小數據集高效訓練

通過精挑細選的小型數據集和策略性數據增強，實現媲美大數據集訓練的生成效果

聯合增強訓練

同時採用文本增強與圖像增強的聯合訓練方法，提升模型理解能力

細節生成優化

特別優化對超長細節描述的圖像生成能力，適合複雜場景渲染

模型能力

文本生成圖像

複雜場景渲染

高細節圖像生成

使用案例

創意設計

概念藝術創作

根據詳細文字描述生成高質量概念藝術圖

可生成符合專業要求的場景概念圖

教育應用

教學素材生成

根據教材內容自動生成配套插圖

快速生成與教學內容匹配的視覺素材

🚀 利用 ImageNet 進行文本到圖像生成，我們能走多遠？

本項目聚焦文本到圖像生成，提出利用精心挑選的小數據集進行策略性數據增強，以提升模型性能和生成圖像質量的新方法。

🚀 快速開始

本倉庫包含論文 “How far can we go with ImageNet for Text-to-Image generation?” 的代碼和模型。核心思想是，文本到圖像生成模型通常依賴大量數據集，更注重數量而非質量。常見的解決辦法是收集海量數據。我們提出了一種新方法，通過對精心挑選的小數據集進行策略性數據增強，來提升這些模型的性能。我們的研究表明，該方法在多個基準測試中提高了生成圖像的質量。

論文鏈接：Arxiv GitHub 倉庫：https://github.com/lucasdegeorge/T2I-ImageNet 項目網站：https://lucasdegeorge.github.io/projects/t2i_imagenet/

📦 安裝指南

首先，使用 Python（至少 3.9 版本）創建一個虛擬環境，克隆倉庫，並運行以下命令：

pip install -e .

更多詳細信息請參考此處。

📚 詳細文檔

預訓練模型

CAD - I 模型

在本倉庫中，該模型使用文本增強和圖像增強進行訓練。僅使用文本增強訓練的模型請參考此處。若要使用預訓練模型，請執行以下操作：

from pipe import T2IPipeline
pipe = T2IPipeline("Lucasdegeorge/CAD-I").to("cuda")
prompt = "An adorable otter, with its sleek, brown fur and bright, curious eyes, playfully interacts with a vibrant bunch of broccoli... "
image = pipe(prompt, cfg=15)

如果您只想下載模型，而不下載採樣管道，可以執行以下操作：

from pipe import CAD
model = CAD.from_pretrained("Lucasdegeorge/CAD-I")

DiT - I 模型

即將推出...

提示詞

我們的模型經過專門訓練，能夠處理非常長且詳細的提示詞。為了獲得最佳性能和結果，建議您使用詳細豐富的提示詞。簡短或模糊的提示詞可能無法充分發揮模型的能力。

示例提示詞：

A majestic elephant stands tall and proud in the heart of the African savannah, its wrinkled, gray skin glistening under the intense afternoon sun. The elephant's large, flapping ears and long, sweeping trunk create a sense of grace and power as it gently sways, surveying the vast, golden grasslands stretching out before it. In the distance, a herd of zebras grazes peacefully, their stripes blending with the tall, dry grass. The scene is completed by a lone acacia tree silhouetted against the setting sun, casting long, dramatic shadows across the landscape.
A classic film camera rests on a tripod, its worn leather strap and scratched metal body telling the story of countless adventures and captured moments. The camera is positioned in a scenic landscape, with rolling hills, a winding river, and a distant mountain range bathed in the soft, golden light of sunset. In the foreground, a wildflower meadow sways gently in the breeze, while the camera's lens captures the beauty and tranquility of the scene, preserving it for eternity.
A graceful flamingo stands elegantly in the shallow waters of a tranquil lagoon, its vibrant pink feathers reflecting beautifully in the still water. The flamingo's long, slender legs and curved neck create a picture of poise and balance as it dips its beak into the water, searching for food. Behind the flamingo, a lush mangrove forest stretches out, its dense foliage providing a rich habitat for various wildlife. The scene is completed by a clear blue sky and the gentle rustling of leaves in the breeze
A hearty, overstuffed sandwich sits on a wooden cutting board, its layers of fresh, crisp lettuce, juicy tomatoes, and thinly sliced deli meats peeking out from between two slices of golden-brown bread. The sandwich's tantalizing aroma fills the air, mingling with the scent of freshly baked bread and tangy mustard. In the background, a bustling deli comes to life, with shelves lined with jars of pickles, a gleaming meat slicer, and a chalkboard menu listing the day's specials. The scene is completed by the lively chatter of customers and the clinking of glasses.
A stunning oil painting of a majestic tiger hangs on the wall of a dimly-lit art gallery, its vibrant colors and intricate details drawing the viewer in. The tiger's powerful, muscular body is depicted in mid-stride, its stripes blending seamlessly with the lush jungle foliage surrounding it. The painting captures the tiger's intense, amber eyes and the subtle play of light and shadow on its fur, creating a sense of depth and movement. The background features a dense canopy of trees and a cascading waterfall, adding to the wild, untamed atmosphere of the scene.
A clever magpie perched on a rustic wooden fence post, its iridescent black and white feathers shimmering in the sunlight. The bird tilts its head, holding a shiny trinket in its beak, with a backdrop of a golden wheat field swaying gently in the breeze. A few more curios and found objects are scattered along the fence, hinting at the magpie's treasure trove hidden nearby. A clear blue sky with puffy white clouds completes the scenic countryside atmosphere.
A playful dolphin leaps gracefully out of the sparkling turquoise waters, its sleek, gray body arcing through the air before diving back into the waves with a splash. Nearby, a classic wooden sailboat glides smoothly across the ocean, its white sails billowing in the breeze. The dolphin swims alongside the boat, its joyful antics mirrored by the shimmering sunlight dancing on the water's surface. The scene is completed by a clear blue sky and the distant horizon, where the sea meets the sky

使用管道

T2IPipeline 類為從文本提示詞生成圖像提供了全面的接口。以下是使用它的詳細指南：

💻 基礎用法

from pipe import T2IPipeline
# 初始化管道
pipe = T2IPipeline("Lucasdegeorge/CAD-I").to("cuda")
# 從提示詞生成圖像
prompt = "An adorable otter, with its sleek, brown fur and bright, curious eyes, playfully interacts with a vibrant bunch of broccoli... "
image = pipe(prompt, cfg=15)

高級配置

管道可以使用多個自定義選項進行初始化：

pipe = T2IPipeline(
    model_path="Lucasdegeorge/CAD-I",
    sampler="ddim",                    # 選項: "ddim", "ddpm", "dpm", "dpm_2S", "dpm_2M"
    scheduler="sigmoid",               # 選項: "sigmoid", "cosine", "linear"
    postprocessing="sd_1_5_vae",
    scheduler_start=-3,
    scheduler_end=3,
    scheduler_tau=1.1,
    device="cuda"
)

生成參數

管道的 __call__ 方法接受各種參數來控制生成過程：

image = pipe(
    cond="A beautiful landscape",          # 文本提示詞或提示詞列表
    num_samples=4,                         # 要生成的圖像數量
    cfg=15,                               # 無分類器引導比例
    guidance_type="constant",             # 引導類型: "constant", "linear"
    guidance_start_step=0,                # 開始引導的步驟
    coherence_value=1.0,                  # 採樣的一致性值
    uncoherence_value=0.0,                # 採樣的非一致性值
    thresholding_type="clamp",           # 閾值類型: "clamp", "dynamic_thresholding", "per_channel_dynamic_thresholding"
    clamp_value=1.0,                      # 閾值的鉗位值
    thresholding_percentile=0.995         # 閾值的百分位數
)

引導類型

constant：在整個採樣過程中應用統一的引導
linear：引導強度從開始到結束線性增加
exponential：引導強度從開始到結束指數增加

閾值類型

clamp：使用 clamp_value 將值鉗位到固定範圍
dynamic：根據批次統計信息動態調整閾值
percentile：使用基於百分位數的閾值，閾值百分位數為 thresholding_percentile

高級參數

為了更精細地控制生成過程，您還可以指定以下參數：

x_N：初始噪聲張量
latents：用於繼續生成的先前潛在變量
num_steps：自定義採樣步驟數
sampler：自定義採樣器函數
scheduler：自定義調度器函數
guidance_start_step：開始引導的步驟
generator：用於重現性的隨機數生成器
unconfident_prompt：自定義無信心提示詞文本

📄 許可證

本項目採用 MIT 許可證。

📚 引用

如果您在實驗中使用了本倉庫，請引用以下論文：

@article{degeorge2025farimagenettexttoimagegeneration, 
     title           ={How far can we go with ImageNet for Text-to-Image generation?}, 
     author          ={Lucas Degeorge and Arijit Ghosh and Nicolas Dufour and David Picard and Vicky Kalogeiton}, 
     year            ={2025}, 
     journal         ={arXiv},
 }