aimv2-large-patch14-224-lit開源視覺模型 - 多模態理解表現超優實用之選

首頁

Aimv2 Large Patch14 224 Lit

由apple開發

AIMv2是一個採用多模態自迴歸目標預訓練的視覺模型系列，在多項多模態理解基準測試中表現優異。

圖像生成文本 #多模態自迴歸 #零樣本分類 #開放詞彙檢測

下載量 222

發布時間 : 11/20/2024

模型概述

AIMv2通過多模態自迴歸目標進行預訓練，在圖像分類、目標檢測等任務上展現出強大的性能。

模型特點

多模態自迴歸預訓練

採用創新的自迴歸目標進行預訓練，實現更好的多模態理解能力

卓越的基準測試表現

在多數多模態理解基準測試中超越OpenAI CLIP和SigLIP模型

強大的識別性能

3B版本在使用凍結主幹網絡時，在ImageNet上達到89.5%準確率

廣泛的應用能力

在開放詞彙目標檢測和指代表達理解任務上優於DINOv2

模型能力

零樣本圖像分類

多模態理解

開放詞彙目標檢測

指代表達理解

使用案例

計算機視覺

圖像分類

對圖像內容進行分類識別

ImageNet上89.5%準確率

目標檢測

檢測圖像中的特定目標

優於DINOv2模型

多模態應用

圖文匹配

理解圖像與文本描述之間的關係

超越CLIP和SigLIP模型

🚀 Transformers - 零樣本圖像分類模型

本項目引入了基於多模態自迴歸目標進行預訓練的AIMv2系列視覺模型。AIMv2預訓練簡單直接，易於訓練和有效擴展。該模型在多模態理解基準測試、開放詞彙目標檢測和指代表達理解等方面表現出色。

🚀 快速開始

模型信息

屬性	詳情
庫名稱	transformers
許可證	apple-amlr
任務類型	零樣本圖像分類
標籤	視覺、mlx、pytorch

模型亮點

在大多數多模態理解基準測試中，性能優於OAI CLIP和SigLIP。
在開放詞彙目標檢測和指代表達理解方面，表現優於DINOv2。
AIMv2 - 3B模型在使用凍結主幹的情況下，在ImageNet上達到了89.5%的識別準確率。

AIMv2 Overview

💻 使用示例

基礎用法

import requests
from PIL import Image
from transformers import AutoProcessor, AutoModel

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
text = ["Picture of a dog.", "Picture of a cat.", "Picture of a horse."]

processor = AutoProcessor.from_pretrained(
    "apple/aimv2-large-patch14-224-lit",
)
model = AutoModel.from_pretrained(
    "apple/aimv2-large-patch14-224-lit",
    trust_remote_code=True,
)

inputs = processor(
    images=image,
    text=text,
    add_special_tokens=True,
    truncation=True,
    padding=True,
    return_tensors="pt",
)
outputs = model(**inputs)
probs = outputs.logits_per_image.softmax(dim=-1)

高級用法

JAX版本正在建設中。

📄 許可證

本項目使用的許可證為apple-amlr。

📚 詳細文檔

引用信息

如果您覺得我們的工作有用，請考慮引用我們的論文：

@misc{fini2024multimodalautoregressivepretraininglarge,
  author      = {Fini, Enrico and Shukor, Mustafa and Li, Xiujun and Dufter, Philipp and Klein, Michal and Haldimann, David and Aitharaju, Sai and da Costa, Victor Guilherme Turrisi and Béthune, Louis and Gan, Zhe and Toshev, Alexander T and Eichner, Marcin and Nabi, Moin and Yang, Yinfei and Susskind, Joshua M. and El-Nouby, Alaaeldin},
  url         = {https://arxiv.org/abs/2411.14402},
  eprint      = {2411.14402},
  eprintclass = {cs.CV},
  eprinttype  = {arXiv},
  title       = {Multimodal Autoregressive Pre-training of Large Vision Encoders},
  year        = {2024},
}

論文鏈接：[AIMv2 Paper]