LongCLIP-GmP-ViT-L-14开源模型 - 支持长文本输入，性能提升的实用选择

首页

Longclip GmP ViT L 14

由 zer0int 开发

基于BeichenZhang/LongCLIP-L微调的CLIP模型，支持长文本输入（248词符），采用几何参数化(GmP)技术提升性能

文本生成图像

Transformers

#长文本CLIP #248词符支持 #图像文本匹配

下载量 4,859

发布时间 : 6/15/2024

模型简介

改进版CLIP模型，突破传统77词符限制，特别优化了长文本理解能力，可作为SDXL/Stable Diffusion等生成模型的文本编码器

模型特点

长文本支持

支持248词符输入（传统CLIP仅77词符），显著提升长文本描述的理解能力

几何参数化(GmP)

通过权重分解技术保持预训练知识的几何特性，提升微调稳定性

标签平滑损失

采用自定义损失函数，特别适合小批量/窄领域微调场景

生成模型兼容

可直接替换Stable Diffusion/Flux.1等生成模型的文本编码器

模型能力

长文本图像匹配

生成模型文本编码

跨模态检索

零样本分类

使用案例

AI生成内容

SDXL文本编码增强

作为Stable Diffusion XL的文本编码器，支持更详细的长文本提示

248词符输入的余弦相似度比77词符截断版本提升约29%

跨模态检索

电商产品搜索

根据详细产品描述匹配对应图像

在窄领域微调后ImageNet准确率达0.89

🚀 Long-CLIP微调项目

本项目是对Long-CLIP的微调版本，原模型为 BeichenZhang/LongCLIP-L。该微调项目旨在提升模型在特定任务上的性能，为图像和文本的交互提供更强大的支持。

✨ 主要特性

数据集丰富：使用了 SPRIGHT-T2I/spright_coco 数据集进行微调，提升模型的泛化能力。
性能提升：微调后的模型在 ImageNet/ObjectNet 上的准确率达到了 0.89，相比原模型的约 0.81 有显著提升。
自定义损失：采用了带有标签平滑的自定义损失函数，在不同规模数据集上都有良好表现。
几何参数化：运用 Geometric Parametrization (GmP) 方法，优化模型的权重表示。

📦 安装指南

文档中未提及具体安装步骤，可参考作者 GitHub 仓库 https://github.com/zer0int/Long-CLIP 中的代码进行安装和微调。

💻 使用示例

基础用法

以下是使用 HuggingFace Transformers 加载模型的示例：

model_id = "zer0int/LongCLIP-GmP-ViT-L-14"
model = CLIPModel.from_pretrained(model_id)
processor = CLIPProcessor.from_pretrained(model_id)

高级用法

处理 77 个 token 的情况

# 截断到 77 个 token
CLIPModel.from_pretrained(model_id, ignore_mismatched_sizes=True)

# Cosine similarities for 77 tokens is WORSE:
# tensor[photo of a cat, picture of a dog, cat, dog] # image ground truth: cat photo
tensor([[0.16484, 0.0749, 0.1618, 0.0774]], device='cuda:0') 📉

处理 248 个 token 的情况（推荐）

model_id = ("zer0int/LongCLIP-GmP-ViT-L-14")
config = CLIPConfig.from_pretrained(model_id)
config.text_config.max_position_embeddings = 248
clip_model = CLIPModel.from_pretrained(model_id, torch_dtype=dtype, config=config)
clip_processor = CLIPProcessor.from_pretrained(model_id, padding="max_length", max_length=248)

pipe.tokenizer = clip_processor.tokenizer  # Replace with the CLIP tokenizer
pipe.text_encoder = clip_model.text_model  # Replace with the CLIP text encoder
pipe.tokenizer_max_length = 248
pipe.text_encoder.dtype = torch.bfloat16

# Resulting Cosine Similarities for 248 tokens padded:
# tensor[photo of a cat, picture of a dog, cat, dog] -- image ground truth: cat photo
tensor([[0.2128, 0.0978, 0.1957, 0.1133]], device='cuda:0') ✅

📚 详细文档

使用 Long-CLIP 作为文本编码器

若要将 Long-CLIP 作为 Flux.1、SDXL、Stable Diffusion 的文本编码器，可从 https://github.com/SeaArtLab/ComfyUI-Long-CLIP 获取 ComfyUI Long-CLIP 节点。若不使用 Comfy，该仓库也可作为逆向工程和应用到自己代码中的起点。

HuggingFace Transformers 加载注意事项

在使用 HuggingFace Transformers 加载模型时，会遇到与库中定义的 77 个 token 不匹配的问题，可参考以下两种解决方案：

方案一（简单但效果较差）：截断到 77 个 token。
方案二（推荐）：实现 248 个 token 的处理，具体实现可参考上述高级用法示例。

模型更新

2024 年 8 月 12 日更新：推出新的 BEST 模型，采用带有标签平滑的自定义损失函数。在多样化、大规模高质量数据集上有小幅提升，在易过拟合的微调场景（如小批量、单 GPU、窄数据集，如 'sneakers' 等）中有较大相对提升。可使用提供的 GmP-Smooth 代码在 https://github.com/zer0int/Long-CLIP 上微调模型。

🔧 技术细节

几何参数化 (GmP)

本项目使用 Geometric Parametrization (GmP) 方法对模型的 MLP 层进行优化。传统的 CLIP MLP 层使用线性变换，而 GmP 将权重分解为径向分量 'r'（预训练权重的范数）和角度分量 'theta'（归一化方向），从而保留权重向量的方向性和大小。

"Normal" CLIP MLP (multi-layer perceptron):

(mlp): Sequential(
  |-(c_fc): Linear(in_features=1024, out_features=4096, bias=True)
  | (gelu): QuickGELU()
|-}-(c_proj): Linear(in_features=4096, out_features=1024, bias=True)
| | 
| |-- visual.transformer.resblocks.0.mlp.c_fc.weight
| |-- visual.transformer.resblocks.0.mlp.c_fc.bias
|
|---- visual.transformer.resblocks.0.mlp.c_proj.weight
|---- visual.transformer.resblocks.0.mlp.c_proj.bias


GmP CLIP MLP:

Weight decomposition into:
- radial component 'r' as norm of pre-trained weights
- angular component 'theta' as normalized direction
-> preserves weight vectors' directionality and magnitude

(mlp): Sequential(
  |-(c_fc): GeometricLinear()
  | (gelu): QuickGELU()
|-}-(c_proj): GeometricLinear()
| | 
| |-- visual.transformer.resblocks.0.mlp.c_fc.r
| |-- visual.transformer.resblocks.0.mlp.c_fc.theta
| |-- visual.transformer.resblocks.0.mlp.c_fc.bias
|
|---- visual.transformer.resblocks.0.mlp.c_proj.r
|---- visual.transformer.resblocks.0.mlp.c_proj.theta
|---- visual.transformer.resblocks.0.mlp.c_proj.bias

(Same thing for [text] transformer.resblocks)

📄 许可证

预训练的 CLIP 模型由 OpenAI 提供，遵循 MIT License。

引用信息

@article{zhang2024longclip,
        title={Long-CLIP: Unlocking the Long-Text Capability of CLIP},
        author={Beichen Zhang and Pan Zhang and Xiaoyi Dong and Yuhang Zang and Jiaqi Wang},
        journal={arXiv preprint arXiv:2403.15378},
        year={2024}
}