aimv2-large-patch14-336-distilledオープンソース視覚モデル - 多モーダル理解で優れた実用ツール

ホーム

Aimv2 Large Patch14 336 Distilled

appleによって開発

AIMv2はマルチモーダル自己回帰目標事前学習に基づく視覚モデルシリーズで、マルチモーダル理解ベンチマークで優れた性能を発揮

画像分類 #マルチモーダル自己回帰事前学習 #オープン語彙視覚理解 #高精度画像特徴抽出

ダウンロード数 37

リリース時間 : 11/18/2024

モデル概要

AIMv2は革新的なマルチモーダル自己回帰事前学習手法を採用し、画像特徴抽出とマルチモーダル理解タスクで卓越した性能を示す

モデル特徴

マルチモーダル自己回帰事前学習

革新的な自己回帰目標を用いた事前学習により、マルチモーダル理解能力を効果的に向上

卓越した性能

マルチモーダル理解ベンチマークでCLIPやSigLIPなどの主流モデルを凌駕

強力な認識能力

3BバージョンはImageNetで89.5%の精度を達成(バックボーンネットワーク凍結)

マルチフレームワーク対応

PyTorchとJAXフレームワークの両方をサポート

モデル能力

画像特徴抽出

マルチモーダル理解

オープン語彙物体検出

指示表現理解

使用事例

コンピュータビジョン

画像分類

高精度画像分類タスクに使用

ImageNetで89.5%の精度を達成

物体検出

オープン語彙物体検出アプリケーション

DINOv2モデルを上回る性能

マルチモーダルアプリケーション

視覚-言語理解

画像とテキストの統合理解タスクに使用

CLIPなどの主流モデルを凌駕

🚀 transformers

このライブラリは、多モーダル自己回帰目的で事前学習されたAIMv2ファミリーのビジョンモデルを提供します。AIMv2は、多くのベンチマークで優れた性能を発揮します。

🚀 クイックスタート

[AIMv2 Paper] [BibTeX]

我々は、多モーダル自己回帰目的で事前学習されたAIMv2ファミリーのビジョンモデルを紹介します。AIMv2の事前学習は簡単で、効果的にトレーニングとスケーリングが可能です。AIMv2の主な特長は以下の通りです：

多くの多モーダル理解ベンチマークで、OAI CLIPやSigLIPを上回る性能を発揮します。
オープンボキャブラリ物体検出と参照表現理解において、DINOv2を上回る性能を示します。
AIMv2 - 3Bは、凍結されたトランクを使用してImageNetで*89.5%*の認識性能を達成します。

💻 使用例

基本的な使用法

PyTorch

import requests
from PIL import Image
from transformers import AutoImageProcessor, AutoModel

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

processor = AutoImageProcessor.from_pretrained(
    "apple/aimv2-large-patch14-336-distilled",
)
model = AutoModel.from_pretrained(
    "apple/aimv2-large-patch14-336-distilled",
    trust_remote_code=True,
)

inputs = processor(images=image, return_tensors="pt")
outputs = model(**inputs)

JAX

import requests
from PIL import Image
from transformers import AutoImageProcessor, FlaxAutoModel

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

processor = AutoImageProcessor.from_pretrained(
    "apple/aimv2-large-patch14-336-distilled",
)
model = FlaxAutoModel.from_pretrained(
    "apple/aimv2-large-patch14-336-distilled",
    trust_remote_code=True,
)

inputs = processor(images=image, return_tensors="jax")
outputs = model(**inputs)

📄 ライセンス

apple-amlr

📚 引用

もしこの研究が役に立った場合は、以下のように引用してください：

@misc{fini2024multimodalautoregressivepretraininglarge,
  author      = {Fini, Enrico and Shukor, Mustafa and Li, Xiujun and Dufter, Philipp and Klein, Michal and Haldimann, David and Aitharaju, Sai and da Costa, Victor Guilherme Turrisi and Béthune, Louis and Gan, Zhe and Toshev, Alexander T and Eichner, Marcin and Nabi, Moin and Yang, Yinfei and Susskind, Joshua M. and El-Nouby, Alaaeldin},
  url         = {https://arxiv.org/abs/2411.14402},
  eprint      = {2411.14402},
  eprintclass = {cs.CV},
  eprinttype  = {arXiv},
  title       = {Multimodal Autoregressive Pre-training of Large Vision Encoders},
  year        = {2024},
}