Instruct - CLIP開源模型 - 自動優化數據，大幅提升指令引導圖像編輯效果

首頁

Instruct CLIP

由SherryXTChen開發

InstructCLIP是一種通過對比學習自動優化數據的模型，旨在提升指令引導的圖像編輯效果。

文本生成圖像英語開源協議:Apache-2.0 #指令引導圖像編輯 #對比學習優化 #自動指令生成

下載量 74

發布時間 : 3/25/2025

模型概述

該模型基於對比學習技術，能夠自動優化數據以提升指令引導的圖像編輯效果，適用於圖像到圖像的轉換任務。

模型特點

自動優化數據

通過對比學習技術自動優化數據，提升指令引導的圖像編輯效果。

指令引導編輯

支持通過自然語言指令引導圖像編輯，實現更精準的圖像轉換。

高效圖像處理

基於LatentDiffusion和DINOv2的混合架構，實現高效的圖像處理能力。

模型能力

圖像編輯

指令引導轉換

圖像到圖像轉換

使用案例

圖像編輯

3D雕塑轉換

將普通圖像轉換為3D雕塑效果。

生成具有3D雕塑風格的圖像。

風格轉換

根據指令將圖像轉換為特定風格。

生成符合指令風格的圖像。

🚀 InstructCLIP：利用對比學習進行自動數據精煉改進指令引導的圖像編輯 (CVPR 2025)

本項目基於對比學習實現自動數據精煉，改進了指令引導的圖像編輯技術，有效提升了圖像編輯的準確性和效率。

模型信息

屬性	詳情
基礎模型	SherryXTChen/LatentDiffusionDINOv2
訓練數據集	timbrooks/instructpix2pix - clip - filtered、SherryXTChen/InstructCLIP - InstructPix2Pix - Data
模型類型	image - to - image
庫名稱	diffusers
標籤	model_hub_mixin、pytorch_model_hub_mixin
許可證	apache - 2.0

🚀 快速開始

本模型已使用 PytorchModelHubMixin 集成推送到模型中心。該模型基於論文 Instruct - CLIP: Improving Instruction - Guided Image Editing with Automated Data Refinement Using Contrastive Learning。

✨ 主要特性

📦 安裝指南

pip install -r requirements.txt

💻 使用示例

基礎用法

from PIL import Image
import torch
from torchvision import transforms

from model import InstructCLIP
from utils import get_sd_components, normalize

parser = argparse.ArgumentParser(description="Simple example of estimating edit instruction from image pair")
parser.add_argument(
    "--pretrained_instructclip_name_or_path",
    type=str,
    default="SherryXTChen/Instruct-CLIP",
    help=(
        "instructclip pretrained checkpoints"
    ),
)
parser.add_argument(
    "--pretrained_model_name_or_path",
    type=str,
    default="runwayml/stable-diffusion-v1-5",
    help=(
        "sd pretrained checkpoints"
    ),
)
parser.add_argument(
    "--input_path",
    type=str,
    default="assets/1_input.jpg",
    help=(
        "Input image path"
    )
)
parser.add_argument(
    "--output_path",
    type=str,
    default="assets/1_output.jpg",
    help=(
        "Output image path"
    )
)
args = parser.parse_args()
device = "cuda"
    
# load model for edit instruction estimation
model = InstructCLIP.from_pretrained("SherryXTChen/Instruct-CLIP")
model = model.to(device).eval()

# load model to preprocess/encode image to latent space
tokenizer, _, vae, _, _ = get_sd_components(args, device, torch.float32)

# prepare image input
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.5], std=[0.5]),
])
image_list = [args.input_path, args.output_path]
image_list = [
    transform(Image.open(f).resize((512, 512))).unsqueeze(0).to(device) 
    for f in image_list
]

with torch.no_grad():
    image_list = [vae.encode(x).latent_dist.sample() * vae.config.scaling_factor for x in image_list]
    
    # get image feature
    zero_timesteps = torch.zeros_like(torch.tensor([0])).to(device) 
    img_feat = model.get_image_features(
        inp=image_list[0], out=image_list[1], inp_t=zero_timesteps, out_t=zero_timesteps)
    img_feat = normalize(img_feat)
    
    # get edit instruction
    pred_instruct_input_ids = model.text_decoder.infer(img_feat[:1])[0]
    pred_instruct = tokenizer.decode(pred_instruct_input_ids, skip_special_tokens=True)
    print(pred_instruct)  # as a 3 d sculpture

📄 許可證

本項目採用 apache - 2.0 許可證。

📚 引用

@misc{chen2025instructclipimprovinginstructionguidedimage,
      title={Instruct-CLIP: Improving Instruction-Guided Image Editing with Automated Data Refinement Using Contrastive Learning}, 
      author={Sherry X. Chen and Misha Sra and Pradeep Sen},
      year={2025},
      eprint={2503.18406},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2503.18406}, 
}