ProLIP-ViT-B-16-DC-1B-12_8B開源模型 - 基於大數據集實現圖像與語言關聯應用

首頁

Prolip ViT B 16 DC 1B 12 8B

由SanghyukChun開發

基於DataComp 1B數據集預訓練的概率語言-圖像預訓練(ProLIP)ViT-B/16模型

文本生成圖像

Safetensors

開源協議:MIT #零樣本圖像分類 #概率視覺語言模型 #大規模預訓練

下載量 460

發布時間 : 10/18/2024

模型概述

這是一個採用概率語言-圖像預訓練方法(ProLIP)的視覺語言模型，能夠處理圖像分類和跨模態檢索任務，特別擅長零樣本學習場景。

模型特點

概率建模

採用概率方法建模圖像和文本特徵分佈，能夠量化預測不確定性

大規模預訓練

在DataComp 1B數據集上預訓練，實際使用12.8億訓練樣本

零樣本學習能力

無需微調即可在新任務上表現出色，支持零樣本圖像分類和檢索

不確定性感知

能夠輸出圖像和文本特徵的不確定性估計，提高預測可靠性

模型能力

零樣本圖像分類

跨模態檢索

不確定性估計

多模態特徵提取

使用案例

圖像理解

零樣本圖像分類

無需特定訓練即可對新圖像進行分類

ImageNet-1k上達到74.6% top-1準確率

跨模態檢索

圖文檢索

根據文本查詢檢索相關圖像，或根據圖像檢索相關文本

零樣本檢索性能59.6%

魯棒性評估

分佈偏移評估

在ImageNet分佈偏移數據上評估模型魯棒性

達到63.0%準確率

🚀 基於DataComp 1B的預訓練ViT - B/16 ProLIP官方實現

本項目是基於DataComp 1B數據集，對ViT - B/16模型進行概率語言 - 圖像預訓練（ProLIP）的官方實現。該預訓練權重在圖像分類、檢索等零樣本任務中表現出色，為相關領域的研究和應用提供了強大的支持。

🚀 快速開始

環境準備

確保你已經安裝了必要的庫：

import requests
from PIL import Image

import torch
from prolip.model import ProLIPHF
from transformers import CLIPProcessor
from prolip.tokenizer import HFTokenizer

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

模型加載

processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch16")
model = ProLIPHF.from_pretrained("SanghyukChun/ProLIP-ViT-B-16-DC-1B-12_8M")
tokenizer = HFTokenizer("timm/ViT-B-16-SigLIP", context_length=64, clean="canonicalize")

示例代碼

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
inputs = processor(images=image, return_tensors="pt", padding=True)
texts = ["A couple of cats laying on top of a pink blanket.", "A man walks through a flooded road during a rainstorm", "photo"]
texts = tokenizer(texts)

outputs = model(image=inputs["pixel_values"], text=texts)

l2_logit = outputs["image_features"]["mean"] @ outputs["text_features"]["mean"].T
i_unc = torch.exp(outputs["image_features"]["std"]).sum(dim=-1)
t_unc = torch.exp(outputs["text_features"]["std"]).sum(dim=-1)
csd_logit = l2_logit - 0.5 * t_unc
csd_logit2 = l2_logit.T - 0.5 * i_unc
print("Mean-only image-to-text logits (by L2 distance):", l2_logit)
print("Uncertainty-aware image-to-text logits (by CSD):", csd_logit)
print("Uncertainty-aware text-to-image logits (by CSD):", csd_logit2.T)
print("Image uncertainty: ", i_unc)
print("Text uncertainty: ", t_unc)

✨ 主要特性

預訓練模型：本權重是通過概率語言 - 圖像預訓練（ProLIP）得到的預訓練ViT - B/16模型。
預訓練數據集：使用DataComp 1B數據集，可見樣本達128億。
多任務表現出色：在零樣本圖像分類、零樣本圖像分佈偏移、零樣本VTAB任務、零樣本檢索等任務上均有良好表現。

📚 詳細文檔

項目概述

論文鏈接：Probabilistic Language - Image Pre - Training
GitHub倉庫：https://github.com/naver - ai/prolip
更多模型：可在Hugging Face查看更多相關模型。

性能概述

任務類型	準確率
零樣本ImageNet - 1k top - 1準確率	74.6%
零樣本ImageNet分佈偏移	63.0%
零樣本VTAB性能	63.7%
零樣本檢索性能	59.6%
38個任務的平均零樣本性能	63.3%

📄 許可證

本項目採用MIT許可證。

📚 引用

如果你使用了本項目的代碼或模型，請引用以下論文：

@inproceedings{chun2025prolip,
    title={Probabilistic Language-Image Pre-Training},
    author={Chun, Sanghyuk and Kim, Wonjae and Park, Song and Yun, Sangdoo},
    year={2025},
    booktitle={International Conference on Learning Representations (ICLR)},
}

@inproceedings{chun2025longprolip,
    title={LongProLIP: A Probabilistic Vision-Language Model with Long Context Text},
    author={Chun, Sanghyuk and Yun, Sangdoo},
    year={2025},
    booktitle={ICLR Workshop on Quantify Uncertainty and Hallucination in Foundation Models},
}