aimv2-large-patch14-native開源視覺模型 - 多模態理解表現優異，功能強大

首頁

Aimv2 Large Patch14 Native

由apple開發

AIMv2是採用多模態自迴歸目標預訓練的視覺模型系列，在多項多模態理解基準測試中表現優異。

圖像分類 #多模態自迴歸預訓練 #高精度圖像特徵提取 #開放詞彙理解

下載量 788

發布時間 : 11/21/2024

模型概述

AIMv2通過多模態自迴歸目標進行預訓練，在圖像特徵提取和多模態理解任務上展現出卓越性能。

模型特點

卓越的多模態理解能力

在多數多模態理解基準測試中超越OpenAI CLIP和SigLIP模型

強大的識別性能

AIMv2-3B版本在使用凍結主幹網絡時達到ImageNet 89.5%準確率

開放詞彙理解優勢

在開放詞彙目標檢測和指代表達理解任務上優於DINOv2

高效預訓練方法

採用簡單直接的多模態自迴歸目標預訓練，能有效擴展訓練規模

模型能力

圖像特徵提取

多模態理解

開放詞彙目標檢測

指代表達理解

大規模視覺表示學習

使用案例

計算機視覺

圖像分類

使用預訓練特徵進行圖像分類任務

ImageNet上達到89.5%準確率

目標檢測

開放詞彙環境下的目標檢測

優於DINOv2模型

多模態應用

視覺-語言理解

圖像與文本的聯合表示學習

超越CLIP和SigLIP模型

🚀 AIMv2視覺模型庫

AIMv2是一系列基於多模態自迴歸目標進行預訓練的視覺模型，訓練過程簡單直接，可有效進行擴展。該模型在多模態理解基準測試、開放詞彙目標檢測等多個任務中表現出色。

🚀 快速開始

我們引入了通過多模態自迴歸目標進行預訓練的AIMv2系列視覺模型。AIMv2的預訓練過程簡單直接，能夠有效進行訓練和擴展。AIMv2的一些亮點包括：

在大多數多模態理解基準測試中，性能優於OAI CLIP和SigLIP。
在開放詞彙目標檢測和指代表達理解任務中，性能優於DINOv2。
展現出強大的識別性能，AIMv2 - 3B在使用凍結主幹網絡的情況下，在ImageNet上達到了*89.5%*的準確率。

AIMv2概述

💻 使用示例

基礎用法

PyTorch

import requests
from PIL import Image
from transformers import AutoImageProcessor, AutoModel

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

processor = AutoImageProcessor.from_pretrained(
    "apple/aimv2-large-patch14-native",
)
model = AutoModel.from_pretrained(
    "apple/aimv2-large-patch14-native",
    trust_remote_code=True,
)

inputs = processor(images=image, return_tensors="pt")
outputs = model(**inputs)

JAX

import requests
from PIL import Image
from transformers import AutoImageProcessor, FlaxAutoModel

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

processor = AutoImageProcessor.from_pretrained(
    "apple/aimv2-large-patch14-native",
)
model = FlaxAutoModel.from_pretrained(
    "apple/aimv2-large-patch14-native",
    trust_remote_code=True,
)

inputs = processor(images=image, return_tensors="jax")
outputs = model(**inputs)

📄 許可證

本項目採用apple-amlr許可證。

📚 詳細文檔

[AIMv2論文] [BibTeX]

📚 引用

如果您覺得我們的工作有用，請考慮按以下方式引用：

@misc{fini2024multimodalautoregressivepretraininglarge,
  author      = {Fini, Enrico and Shukor, Mustafa and Li, Xiujun and Dufter, Philipp and Klein, Michal and Haldimann, David and Aitharaju, Sai and da Costa, Victor Guilherme Turrisi and Béthune, Louis and Gan, Zhe and Toshev, Alexander T and Eichner, Marcin and Nabi, Moin and Yang, Yinfei and Susskind, Joshua M. and El-Nouby, Alaaeldin},
  url         = {https://arxiv.org/abs/2411.14402},
  eprint      = {2411.14402},
  eprintclass = {cs.CV},
  eprinttype  = {arXiv},
  title       = {Multimodal Autoregressive Pre-training of Large Vision Encoders},
  year        = {2024},
}