Instruct - CLIP开源模型 - 自动优化数据，大幅提升指令引导图像编辑效果

首页

Instruct CLIP

由 SherryXTChen 开发

InstructCLIP是一种通过对比学习自动优化数据的模型，旨在提升指令引导的图像编辑效果。

文本生成图像英语开源协议:Apache-2.0 #指令引导图像编辑 #对比学习优化 #自动指令生成

下载量 74

发布时间 : 3/25/2025

模型简介

该模型基于对比学习技术，能够自动优化数据以提升指令引导的图像编辑效果，适用于图像到图像的转换任务。

模型特点

自动优化数据

通过对比学习技术自动优化数据，提升指令引导的图像编辑效果。

指令引导编辑

支持通过自然语言指令引导图像编辑，实现更精准的图像转换。

高效图像处理

基于LatentDiffusion和DINOv2的混合架构，实现高效的图像处理能力。

模型能力

图像编辑

指令引导转换

图像到图像转换

使用案例

图像编辑

3D雕塑转换

将普通图像转换为3D雕塑效果。

生成具有3D雕塑风格的图像。

风格转换

根据指令将图像转换为特定风格。

生成符合指令风格的图像。

🚀 InstructCLIP：利用对比学习进行自动数据精炼改进指令引导的图像编辑 (CVPR 2025)

本项目基于对比学习实现自动数据精炼，改进了指令引导的图像编辑技术，有效提升了图像编辑的准确性和效率。

模型信息

属性	详情
基础模型	SherryXTChen/LatentDiffusionDINOv2
训练数据集	timbrooks/instructpix2pix - clip - filtered、SherryXTChen/InstructCLIP - InstructPix2Pix - Data
模型类型	image - to - image
库名称	diffusers
标签	model_hub_mixin、pytorch_model_hub_mixin
许可证	apache - 2.0

🚀 快速开始

本模型已使用 PytorchModelHubMixin 集成推送到模型中心。该模型基于论文 Instruct - CLIP: Improving Instruction - Guided Image Editing with Automated Data Refinement Using Contrastive Learning。

✨ 主要特性

📦 安装指南

pip install -r requirements.txt

💻 使用示例

基础用法

from PIL import Image
import torch
from torchvision import transforms

from model import InstructCLIP
from utils import get_sd_components, normalize

parser = argparse.ArgumentParser(description="Simple example of estimating edit instruction from image pair")
parser.add_argument(
    "--pretrained_instructclip_name_or_path",
    type=str,
    default="SherryXTChen/Instruct-CLIP",
    help=(
        "instructclip pretrained checkpoints"
    ),
)
parser.add_argument(
    "--pretrained_model_name_or_path",
    type=str,
    default="runwayml/stable-diffusion-v1-5",
    help=(
        "sd pretrained checkpoints"
    ),
)
parser.add_argument(
    "--input_path",
    type=str,
    default="assets/1_input.jpg",
    help=(
        "Input image path"
    )
)
parser.add_argument(
    "--output_path",
    type=str,
    default="assets/1_output.jpg",
    help=(
        "Output image path"
    )
)
args = parser.parse_args()
device = "cuda"
    
# load model for edit instruction estimation
model = InstructCLIP.from_pretrained("SherryXTChen/Instruct-CLIP")
model = model.to(device).eval()

# load model to preprocess/encode image to latent space
tokenizer, _, vae, _, _ = get_sd_components(args, device, torch.float32)

# prepare image input
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.5], std=[0.5]),
])
image_list = [args.input_path, args.output_path]
image_list = [
    transform(Image.open(f).resize((512, 512))).unsqueeze(0).to(device) 
    for f in image_list
]

with torch.no_grad():
    image_list = [vae.encode(x).latent_dist.sample() * vae.config.scaling_factor for x in image_list]
    
    # get image feature
    zero_timesteps = torch.zeros_like(torch.tensor([0])).to(device) 
    img_feat = model.get_image_features(
        inp=image_list[0], out=image_list[1], inp_t=zero_timesteps, out_t=zero_timesteps)
    img_feat = normalize(img_feat)
    
    # get edit instruction
    pred_instruct_input_ids = model.text_decoder.infer(img_feat[:1])[0]
    pred_instruct = tokenizer.decode(pred_instruct_input_ids, skip_special_tokens=True)
    print(pred_instruct)  # as a 3 d sculpture

📄 许可证

本项目采用 apache - 2.0 许可证。

📚 引用

@misc{chen2025instructclipimprovinginstructionguidedimage,
      title={Instruct-CLIP: Improving Instruction-Guided Image Editing with Automated Data Refinement Using Contrastive Learning}, 
      author={Sherry X. Chen and Misha Sra and Pradeep Sen},
      year={2025},
      eprint={2503.18406},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2503.18406}, 
}