aimv2-large-patch14-224-distilled開源視覺模型

首頁

Aimv2 Large Patch14 224 Distilled

由apple開發

AIMv2是通過多模態自迴歸目標預訓練的視覺模型系列，在多模態理解基準測試中表現優異。

圖像分類 #多模態自迴歸預訓練 #開放詞彙目標檢測 #高精度圖像識別

下載量 236

發布時間 : 11/4/2024

模型概述

AIMv2是一種高效的視覺模型，採用多模態自迴歸目標預訓練，適用於圖像特徵提取等任務，在多項基準測試中超越同類模型。

模型特點

多模態預訓練

採用自迴歸目標進行多模態預訓練，提升模型理解能力

高性能表現

在多項基準測試中超越CLIP、SigLIP和DINOv2等模型

高效擴展

預訓練方法簡單直接，能高效擴展到更大規模

高準確率

AIMv2-3B在ImageNet上達到89.5%的準確率

模型能力

圖像特徵提取

多模態理解

開放詞彙目標檢測

指代表達理解

使用案例

計算機視覺

圖像分類

用於高精度圖像分類任務

ImageNet上達到89.5%準確率

目標檢測

開放詞彙目標檢測

超越DINOv2模型

多模態應用

視覺語言理解

理解圖像與文本的關聯

在多模態理解基準中表現優異

🚀 視覺模型AIMv2

AIMv2是一系列視覺模型，通過多模態自迴歸目標進行預訓練。該模型預訓練簡單直接，能夠有效進行訓練和擴展。其在多模態理解基準測試、開放詞彙目標檢測和指代表達理解等任務中表現出色，具有很強的識別性能。

🚀 快速開始

模型信息

屬性	詳情
庫名稱	transformers
許可證	apple-amlr
評估指標	準確率
任務類型	圖像特徵提取
標籤	視覺、圖像特徵提取、mlx、pytorch

模型介紹

[AIMv2論文] [BibTeX]

我們推出了AIMv2系列視覺模型，這些模型通過多模態自迴歸目標進行預訓練。AIMv2的預訓練過程簡單直接，能夠有效進行訓練和擴展。AIMv2的一些亮點包括：

在大多數多模態理解基準測試中，性能優於OAI CLIP和SigLIP。
在開放詞彙目標檢測和指代表達理解方面，性能優於DINOv2。
展現出強大的識別性能，AIMv2 - 3B在使用凍結主幹的情況下，在ImageNet上達到了*89.5%*的準確率。

💻 使用示例

基礎用法 - PyTorch

import requests
from PIL import Image
from transformers import AutoImageProcessor, AutoModel

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

processor = AutoImageProcessor.from_pretrained(
    "apple/aimv2-large-patch14-224-distilled",
)
model = AutoModel.from_pretrained(
    "apple/aimv2-large-patch14-224-distilled",
    trust_remote_code=True,
)

inputs = processor(images=image, return_tensors="pt")
outputs = model(**inputs)

基礎用法 - JAX

import requests
from PIL import Image
from transformers import AutoImageProcessor, FlaxAutoModel

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

processor = AutoImageProcessor.from_pretrained(
    "apple/aimv2-large-patch14-224-distilled",
)
model = FlaxAutoModel.from_pretrained(
    "apple/aimv2-large-patch14-224-distilled",
    trust_remote_code=True,
)

inputs = processor(images=image, return_tensors="jax")
outputs = model(**inputs)

📄 許可證

本項目使用的許可證為apple - amlr。

📚 引用

如果您覺得我們的工作有用，請考慮按照以下方式引用：

@misc{fini2024multimodalautoregressivepretraininglarge,
  author      = {Fini, Enrico and Shukor, Mustafa and Li, Xiujun and Dufter, Philipp and Klein, Michal and Haldimann, David and Aitharaju, Sai and da Costa, Victor Guilherme Turrisi and Béthune, Louis and Gan, Zhe and Toshev, Alexander T and Eichner, Marcin and Nabi, Moin and Yang, Yinfei and Susskind, Joshua M. and El-Nouby, Alaaeldin},
  url         = {https://arxiv.org/abs/2411.14402},
  eprint      = {2411.14402},
  eprintclass = {cs.CV},
  eprinttype  = {arXiv},
  title       = {Multimodal Autoregressive Pre-training of Large Vision Encoders},
  year        = {2024},
}