CLIP-ViT-L-14-spectrum-icons-20k開源模型 - 用於抽象圖像與文本檢索任務

首頁

CLIP ViT L 14 Spectrum Icons 20k

由JianLiao開發

基於CLIP ViT-L/14微調的視覺語言模型，專為抽象圖像-文本檢索任務優化

文本生成圖像

TensorBoard

英語開源協議:MIT #零樣本圖像分類 #抽象視覺檢索 #文本圖像對齊

下載量 1,576

發布時間 : 1/5/2025

模型概述

該模型在23,000個抽象圖像-文本對上微調，提升了文本到圖像和圖像到文本檢索性能，特別適合處理抽象視覺特徵

模型特點

抽象視覺特徵理解

通過專用數據集微調，增強了對抽象圖標和符號的理解能力

高效檢索能力

在圖像-文本雙向檢索任務中R@1達到70%，R@5超過96%

領域適應性

保持基礎模型泛化能力的同時，優化了特定領域的表現

模型能力

零樣本圖像分類

文本到圖像檢索

圖像到文本檢索

抽象視覺特徵匹配

使用案例

信息檢索

圖標庫搜索

通過自然語言描述檢索匹配的圖標圖像

R@1準確率約70%

內容管理

自動圖像標註

為抽象圖標生成描述性文本標籤

🚀 CLIP-ViT-L-14-spectrum-icons-23k模型卡片

本模型是基於預訓練模型進一步微調的成果，旨在提升文本到圖像以及圖像到文本的檢索性能，能有效處理抽象視覺特徵，增強RAG性能。

🚀 快速開始

安裝所需依賴並加載微調後的模型：

from open_clip import create_model_and_transforms, tokenizer

model, preprocess = create_model_and_transforms(
    model_name="hf-hub:JianLiao/CLIP-ViT-L-14-spectrum-icons-20k"
)

tokenizer = tokenizer("ViT-L-14")

# 示例：文本到圖像檢索
text_inputs = tokenizer(["a description of the image", "another description of the image"])
image = preprocess("/path/to/image.png").unsqueeze(0)

with torch.no_grad():
    logits_per_image, logits_per_text = model(image, text_inputs)
    probs = logits_per_image.softmax(dim=-1).numpy()

✨ 主要特性

直接用途

零樣本圖像分類。
文本到圖像以及圖像到圖像的檢索。
在抽象視覺上下文中改善文本 - 圖像對齊。

下游用途

針對特定領域的圖像 - 文本檢索任務進行微調。
集成到需要增強語義搜索的應用程序中。

📦 安裝指南

文檔中未提及具體安裝命令，跳過此章節。

💻 使用示例

基礎用法

from open_clip import create_model_and_transforms, tokenizer

model, preprocess = create_model_and_transforms(
    model_name="hf-hub:JianLiao/CLIP-ViT-L-14-spectrum-icons-20k"
)

tokenizer = tokenizer("ViT-L-14")

# 示例：文本到圖像檢索
text_inputs = tokenizer(["a description of the image", "another description of the image"])
image = preprocess("/path/to/image.png").unsqueeze(0)

with torch.no_grad():
    logits_per_image, logits_per_text = model(image, text_inputs)
    probs = logits_per_image.softmax(dim=-1).numpy()

高級用法

文檔中未提及高級用法代碼示例，跳過此部分。

📚 詳細文檔

模型詳情

模型描述

這是一個基於LAION預訓練的laion/CLIP-ViT-L-14-laion2B-s32B-b82K進行微調的CLIP ViT-L/14模型。使用包含23,000個PNG - 文本描述對的自定義數據集(JianLiao/spectrum-icons)進行微調，以改善文本到圖像和圖像到文本的檢索任務。微調過程使用了OpenCLIP庫和NVIDIA GPU，使模型能夠更好地處理抽象視覺特徵，增強RAG性能。

基礎模型最初在LAION - 2B數據集上進行訓練，利用自然語言監督來對齊視覺和文本嵌入。本次微調任務旨在使模型進一步適應特定領域，同時保持泛化能力。

訓練詳情

訓練數據

模型在23,000個圖像 - 文本描述對上進行了微調。該數據集包含了多樣化和抽象的視覺元素，並配有詳細的文本描述，以增強模型處理抽象查詢和檢索任務的能力。

訓練過程

微調使用OpenCLIP庫在配備6塊NVIDIA RTX - 3090 GPU的機器上進行。關鍵超參數如下：

學習率：5e-6，採用餘弦衰減。
批量大小：每塊GPU的批量大小為64，全局有效批量大小為384。
訓練輪數：40。
精度：混合精度(amp_bf16)以提高效率。
數據增強：
- 顏色抖動：(0.2, 0.2, 0.1, 0.0)，概率為0.7。
- 灰度化概率：0.2。

訓練過程中採用了梯度檢查點、分佈式數據並行(NCCL)，並定期進行零樣本性能評估。每個epoch後進行驗證。

評估

測試數據、因素和指標

測試數據

模型在從23,000個圖像 - 文本對中劃分出的驗證集上進行評估。針對圖像到文本和文本到圖像的檢索任務計算指標。

指標

K召回率：
- 圖像到文本和文本到圖像檢索的R@1、R@5、R@10。
平均排名和中位數排名：
- 檢索中正確匹配的平均和中位數位置。

結果

圖像到文本檢索：
- R@1：約70.0%
- R@5：約96.0%
- R@10：約97.8%
- 平均排名：約2.24
- 中位數排名：約1.0
文本到圖像檢索：
- R@1：約70.3%
- R@5：約96.4%
- R@10：約98.1%
- 平均排名：約2.17
- 中位數排名：約1.0

結果表明，視覺和文本嵌入之間具有強大的對齊能力，在兩個檢索任務上都表現出色。

致謝

預訓練基礎模型由LAION開發，並在LAION - 2B數據集上進行訓練。

引用

BibTeX格式引用如下：

@inproceedings{cherti2023reproducible,
  title={Reproducible scaling laws for contrastive language-image learning},
  author={Cherti, Mehdi and Beaumont, Romain and Wightman, Ross and Wortsman, Mitchell and Ilharco, Gabriel and Gordon, Cade and Schuhmann, Christoph and Schmidt, Ludwig and Jitsev, Jenia},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={2818--2829},
  year={2023}
}

OpenAI CLIP論文

@inproceedings{Radford2021LearningTV,
  title={Learning Transferable Visual Models From Natural Language Supervision},
  author={Alec Radford and Jong Wook Kim and Chris Hallacy and A. Ramesh and Gabriel Goh and Sandhini Agarwal and Girish Sastry and Amanda Askell and Pamela Mishkin and Jack Clark and Gretchen Krueger and Ilya Sutskever},
  booktitle={ICML},
  year={2021}
}

OpenCLIP軟件

@software{ilharco_gabriel_2021_5143773,
  author       = {Ilharco, Gabriel and
                  Wortsman, Mitchell and
                  Wightman, Ross and
                  Gordon, Cade and
                  Carlini, Nicholas and
                  Taori, Rohan and
                  Dave, Achal and
                  Shankar, Vaishaal and
                  Namkoong, Hongseok and
                  Miller, John and
                  Hajishirzi, Hannaneh and
                  Farhadi, Ali and
                  Schmidt, Ludwig},
  title        = {OpenCLIP},
  month        = jul,
  year         = 2021,
  note         = {If you use this software, please cite it as below.},
  publisher    = {Zenodo},
  version      = {0.1},
  doi          = {10.5281/zenodo.5143773},
  url          = {https://doi.org/10.5281/zenodo.5143773}
}