aimv2-3B-patch14-448オープンソースビジュアルモデル - マルチモーダル事前学習による効率的なビジュアル理解の支援

ホーム

Aimv2 3B Patch14 448

appleによって開発

AIMv2はマルチモーダル自己回帰目標で事前学習された視覚モデルシリーズで、複数の視覚理解ベンチマークで優れた性能を発揮します。

画像分類 #マルチモーダル自己回帰事前学習 #高精度画像分類 #オープン語彙検出

ダウンロード数 161

リリース時間 : 10/29/2024

モデル概要

AIMv2シリーズの視覚モデルはマルチモーダル自己回帰目標で事前学習されており、強力な画像特徴抽出と分類能力を持ち、複数のベンチマークで同類のモデルを上回ります。

モデル特徴

マルチモーダル自己回帰事前学習

マルチモーダル自己回帰目標を用いた事前学習により、モデル性能を効果的に向上させます。

卓越した分類性能

OpenAI CLIP、SigLIP、DINOv2などのモデルを複数のベンチマークで上回ります。

大規模パラメータ

3Bパラメータのモデル規模で、強力な特徴抽出能力を備えています。

モデル能力

画像特徴抽出

画像分類

マルチモーダル理解

使用事例

コンピュータビジョン

画像分類

ImageNetなどのデータセットで高精度な画像分類を行います。

ImageNet-1k精度89.5%

細粒度分類

stanford-carsなどの細粒度分類タスクで優れた性能を発揮します。

stanford-cars精度96.7%

医療画像

病理画像分析

camelyon17などの医療画像データセットで分類を行います。

camelyon17精度93.4%

🚀 トランスフォーマーライブラリ

このライブラリは、画像特徴抽出に特化したビジョンモデルであるAIMv2ファミリーを提供します。多モーダル自己回帰目的で事前学習されたAIMv2は、多くのベンチマークで優れた性能を発揮します。

🚀 クイックスタート

このセクションでは、AIMv2モデルの概要と使用方法を紹介します。

[AIMv2論文] [BibTeX]

我々は、多モーダル自己回帰目的で事前学習されたAIMv2ファミリーのビジョンモデルを導入します。AIMv2の事前学習は簡単で、効果的にスケールできます。AIMv2の主な特徴は以下の通りです。

多くの多モーダル理解ベンチマークで、OAI CLIPやSigLIPを上回る性能を発揮します。
オープンボキャブラリ物体検出や参照表現理解で、DINOv2を上回る性能を示します。
AIMv2 - 3Bは、凍結されたトランクを使用してImageNetで*89.5%*の認識性能を達成します。

✨ 主な機能

高い認識性能：多くのデータセットで優れた精度を達成します。
多モーダル理解：多モーダル自己回帰目的で事前学習されているため、多モーダル理解タスクでも強力です。
簡単な使用方法：transformersライブラリを使用して簡単に呼び出すことができます。

📦 インストール

このライブラリを使用するには、transformersライブラリをインストールする必要があります。以下のコマンドでインストールできます。

pip install transformers

💻 使用例

基本的な使用法

PyTorch

import requests
from PIL import Image
from transformers import AutoImageProcessor, AutoModel

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

processor = AutoImageProcessor.from_pretrained(
    "apple/aimv2-3B-patch14-448",
)
model = AutoModel.from_pretrained(
    "apple/aimv2-3B-patch14-448",
    trust_remote_code=True,
)

inputs = processor(images=image, return_tensors="pt")
outputs = model(**inputs)

JAX

import requests
from PIL import Image
from transformers import AutoImageProcessor, FlaxAutoModel

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

processor = AutoImageProcessor.from_pretrained(
    "apple/aimv2-3B-patch14-448",
)
model = FlaxAutoModel.from_pretrained(
    "apple/aimv2-3B-patch14-448",
    trust_remote_code=True,
)

inputs = processor(images=image, return_tensors="jax")
outputs = model(**inputs)

📚 ドキュメント

メトリクス

メトリクス	説明
accuracy	正解率

モデル情報

プロパティ	詳細
ライブラリ名	transformers
モデル名	aimv2 - 3B - patch14 - 448
ライセンス	apple - amlr
パイプラインタグ	image - feature - extraction
タグ	vision, image - feature - extraction, mlx, pytorch

データセットと評価結果

タスク	データセット	正解率
Classification	imagenet - 1k	89.5%
Classification	inaturalist - 18	85.9%
Classification	cifar10	99.5%
Classification	cifar100	94.5%
Classification	food101	97.4%
Classification	dtd	89.0%
Classification	oxford - pets	97.4%
Classification	stanford - cars	96.7%
Classification	camelyon17	93.4%
Classification	patch - camelyon	89.9%
Classification	rxrx1	9.5%
Classification	eurosat	98.9%
Classification	fmow	66.1%
Classification	domainnet - infographic	74.8%

📄 ライセンス

このプロジェクトは、apple - amlrライセンスの下で提供されています。

📚 引用

もしあなたが我々の研究が役に立ったと感じたら、以下のように引用してください。

@misc{fini2024multimodalautoregressivepretraininglarge,
  author      = {Fini, Enrico and Shukor, Mustafa and Li, Xiujun and Dufter, Philipp and Klein, Michal and Haldimann, David and Aitharaju, Sai and da Costa, Victor Guilherme Turrisi and Béthune, Louis and Gan, Zhe and Toshev, Alexander T and Eichner, Marcin and Nabi, Moin and Yang, Yinfei and Susskind, Joshua M. and El-Nouby, Alaaeldin},
  url         = {https://arxiv.org/abs/2411.14402},
  eprint      = {2411.14402},
  eprintclass = {cs.CV},
  eprinttype  = {arXiv},
  title       = {Multimodal Autoregressive Pre-training of Large Vision Encoders},
  year        = {2024},
}