LongCLIP-GmP-ViT-L-14開源模型 - 支持長文本輸入，性能提升的實用選擇

首頁

Longclip GmP ViT L 14

由zer0int開發

基於BeichenZhang/LongCLIP-L微調的CLIP模型，支持長文本輸入（248詞符），採用幾何參數化(GmP)技術提升性能

文本生成圖像

Transformers

#長文本CLIP #248詞符支持 #圖像文本匹配

下載量 4,859

發布時間 : 6/15/2024

模型概述

改進版CLIP模型，突破傳統77詞符限制，特別優化了長文本理解能力，可作為SDXL/Stable Diffusion等生成模型的文本編碼器

模型特點

長文本支持

支持248詞符輸入（傳統CLIP僅77詞符），顯著提升長文本描述的理解能力

幾何參數化(GmP)

通過權重分解技術保持預訓練知識的幾何特性，提升微調穩定性

標籤平滑損失

採用自定義損失函數，特別適合小批量/窄領域微調場景

生成模型兼容

可直接替換Stable Diffusion/Flux.1等生成模型的文本編碼器

模型能力

長文本圖像匹配

生成模型文本編碼

跨模態檢索

零樣本分類

使用案例

AI生成內容

SDXL文本編碼增強

作為Stable Diffusion XL的文本編碼器，支持更詳細的長文本提示

248詞符輸入的餘弦相似度比77詞符截斷版本提升約29%

跨模態檢索

電商產品搜索

根據詳細產品描述匹配對應圖像

在窄領域微調後ImageNet準確率達0.89

🚀 Long-CLIP微調項目

本項目是對Long-CLIP的微調版本，原模型為 BeichenZhang/LongCLIP-L。該微調項目旨在提升模型在特定任務上的性能，為圖像和文本的交互提供更強大的支持。

✨ 主要特性

數據集豐富：使用了 SPRIGHT-T2I/spright_coco 數據集進行微調，提升模型的泛化能力。
性能提升：微調後的模型在 ImageNet/ObjectNet 上的準確率達到了 0.89，相比原模型的約 0.81 有顯著提升。
自定義損失：採用了帶有標籤平滑的自定義損失函數，在不同規模數據集上都有良好表現。
幾何參數化：運用 Geometric Parametrization (GmP) 方法，優化模型的權重表示。

📦 安裝指南

文檔中未提及具體安裝步驟，可參考作者 GitHub 倉庫 https://github.com/zer0int/Long-CLIP 中的代碼進行安裝和微調。

💻 使用示例

基礎用法

以下是使用 HuggingFace Transformers 加載模型的示例：

model_id = "zer0int/LongCLIP-GmP-ViT-L-14"
model = CLIPModel.from_pretrained(model_id)
processor = CLIPProcessor.from_pretrained(model_id)

高級用法

處理 77 個 token 的情況

# 截斷到 77 個 token
CLIPModel.from_pretrained(model_id, ignore_mismatched_sizes=True)

# Cosine similarities for 77 tokens is WORSE:
# tensor[photo of a cat, picture of a dog, cat, dog] # image ground truth: cat photo
tensor([[0.16484, 0.0749, 0.1618, 0.0774]], device='cuda:0') 📉

處理 248 個 token 的情況（推薦）

model_id = ("zer0int/LongCLIP-GmP-ViT-L-14")
config = CLIPConfig.from_pretrained(model_id)
config.text_config.max_position_embeddings = 248
clip_model = CLIPModel.from_pretrained(model_id, torch_dtype=dtype, config=config)
clip_processor = CLIPProcessor.from_pretrained(model_id, padding="max_length", max_length=248)

pipe.tokenizer = clip_processor.tokenizer  # Replace with the CLIP tokenizer
pipe.text_encoder = clip_model.text_model  # Replace with the CLIP text encoder
pipe.tokenizer_max_length = 248
pipe.text_encoder.dtype = torch.bfloat16

# Resulting Cosine Similarities for 248 tokens padded:
# tensor[photo of a cat, picture of a dog, cat, dog] -- image ground truth: cat photo
tensor([[0.2128, 0.0978, 0.1957, 0.1133]], device='cuda:0') ✅

📚 詳細文檔

使用 Long-CLIP 作為文本編碼器

若要將 Long-CLIP 作為 Flux.1、SDXL、Stable Diffusion 的文本編碼器，可從 https://github.com/SeaArtLab/ComfyUI-Long-CLIP 獲取 ComfyUI Long-CLIP 節點。若不使用 Comfy，該倉庫也可作為逆向工程和應用到自己代碼中的起點。

HuggingFace Transformers 加載注意事項

在使用 HuggingFace Transformers 加載模型時，會遇到與庫中定義的 77 個 token 不匹配的問題，可參考以下兩種解決方案：

方案一（簡單但效果較差）：截斷到 77 個 token。
方案二（推薦）：實現 248 個 token 的處理，具體實現可參考上述高級用法示例。

模型更新

2024 年 8 月 12 日更新：推出新的 BEST 模型，採用帶有標籤平滑的自定義損失函數。在多樣化、大規模高質量數據集上有小幅提升，在易過擬合的微調場景（如小批量、單 GPU、窄數據集，如 'sneakers' 等）中有較大相對提升。可使用提供的 GmP-Smooth 代碼在 https://github.com/zer0int/Long-CLIP 上微調模型。

🔧 技術細節

幾何參數化 (GmP)

本項目使用 Geometric Parametrization (GmP) 方法對模型的 MLP 層進行優化。傳統的 CLIP MLP 層使用線性變換，而 GmP 將權重分解為徑向分量 'r'（預訓練權重的範數）和角度分量 'theta'（歸一化方向），從而保留權重向量的方向性和大小。

"Normal" CLIP MLP (multi-layer perceptron):

(mlp): Sequential(
  |-(c_fc): Linear(in_features=1024, out_features=4096, bias=True)
  | (gelu): QuickGELU()
|-}-(c_proj): Linear(in_features=4096, out_features=1024, bias=True)
| | 
| |-- visual.transformer.resblocks.0.mlp.c_fc.weight
| |-- visual.transformer.resblocks.0.mlp.c_fc.bias
|
|---- visual.transformer.resblocks.0.mlp.c_proj.weight
|---- visual.transformer.resblocks.0.mlp.c_proj.bias


GmP CLIP MLP:

Weight decomposition into:
- radial component 'r' as norm of pre-trained weights
- angular component 'theta' as normalized direction
-> preserves weight vectors' directionality and magnitude

(mlp): Sequential(
  |-(c_fc): GeometricLinear()
  | (gelu): QuickGELU()
|-}-(c_proj): GeometricLinear()
| | 
| |-- visual.transformer.resblocks.0.mlp.c_fc.r
| |-- visual.transformer.resblocks.0.mlp.c_fc.theta
| |-- visual.transformer.resblocks.0.mlp.c_fc.bias
|
|---- visual.transformer.resblocks.0.mlp.c_proj.r
|---- visual.transformer.resblocks.0.mlp.c_proj.theta
|---- visual.transformer.resblocks.0.mlp.c_proj.bias

(Same thing for [text] transformer.resblocks)

📄 許可證

預訓練的 CLIP 模型由 OpenAI 提供，遵循 MIT License。

引用信息

@article{zhang2024longclip,
        title={Long-CLIP: Unlocking the Long-Text Capability of CLIP},
        author={Beichen Zhang and Pan Zhang and Xiaoyi Dong and Yuhang Zang and Jiaqi Wang},
        journal={arXiv preprint arXiv:2403.15378},
        year={2024}
}