Open-source AIMv2-large-patch14-224 Visual Model - Excellent Performance in Multiple Practical Visual Tasks

Aimv2 Large Patch14 224

Developed by apple

AIMv2 is a series of vision models pretrained with multimodal autoregressive objectives, excelling in various vision tasks.

Image Classification #Multimodal Autoregressive Pretraining #Open Vocabulary Visual Understanding #High-Precision Image Classification

Downloads 759

Release Time : 10/29/2024

Model Overview

AIMv2 employs multimodal autoregressive pretraining, featuring robust image feature extraction capabilities suitable for diverse visual classification tasks.

Model Features

Multimodal Autoregressive Pretraining

Utilizes innovative multimodal autoregressive objectives for pretraining to enhance model performance.

Outstanding Classification Performance

Achieves state-of-the-art classification accuracy on multiple benchmark datasets.

Strong Scalability

Simple and direct pretraining method enables effective scaling of training size.

Model Capabilities

Image Feature Extraction

Image Classification

Multimodal Understanding

Use Cases

Computer Vision

General Image Classification

Classification on general image datasets such as ImageNet

ImageNet-1k accuracy 86.6%

Fine-Grained Classification

Application on fine-grained classification tasks like stanford-cars

stanford-cars accuracy 96.3%

Medical Image Analysis

Application on medical image datasets such as camelyon17

camelyon17 accuracy 93.7%

🚀 Transformers Library - AIMv2 Model

The AIMv2 family of vision models pre - trained with a multimodal autoregressive objective, offering high - performance on various vision tasks.

🚀 Quick Start

The transformers library provides easy - to - use interfaces for the AIMv2 model. You can quickly start using it in both PyTorch and JAX environments.

✨ Features

Multimodal Autoregressive Pre - training: AIMv2 is pre - trained with a multimodal autoregressive objective, which is simple and effective for training and scaling.
High Performance:
- Outperforms OAI CLIP and SigLIP on the majority of multimodal understanding benchmarks.
- Outperforms DINOv2 on open - vocabulary object detection and referring expression comprehension.
- Achieves strong recognition performance, e.g., AIMv2 - 3B reaches 89.5% on ImageNet using a frozen trunk.

📦 Model Information

Property	Details
Library Name	transformers
Model Type	aimv2 - large - patch14 - 224
License	apple - amlr
Pipeline Tag	image - feature - extraction
Tags	vision, image - feature - extraction, mlx, pytorch
Metrics	accuracy

Model Performance

The following table shows the accuracy of the aimv2 - large - patch14 - 224 model on different classification datasets:

Dataset	Accuracy
imagenet - 1k	86.6%
inaturalist - 18	76.0%
cifar10	99.1%
cifar100	92.2%
food101	95.7%
dtd	87.9%
oxford - pets	96.3%
stanford - cars	96.3%
camelyon17	93.7%
patch - camelyon	89.3%
rxrx1	5.6%
eurosat	98.4%
fmow	60.7%
domainnet - infographic	69.0%

💻 Usage Examples

Basic Usage

PyTorch

import requests
from PIL import Image
from transformers import AutoImageProcessor, AutoModel

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

processor = AutoImageProcessor.from_pretrained(
    "apple/aimv2-large-patch14-224",
)
model = AutoModel.from_pretrained(
    "apple/aimv2-large-patch14-224",
    trust_remote_code=True,
)

inputs = processor(images=image, return_tensors="pt")
outputs = model(**inputs)

JAX

import requests
from PIL import Image
from transformers import AutoImageProcessor, FlaxAutoModel

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

processor = AutoImageProcessor.from_pretrained(
    "apple/aimv2-large-patch14-224",
)
model = FlaxAutoModel.from_pretrained(
    "apple/aimv2-large-patch14-224",
    trust_remote_code=True,
)

inputs = processor(images=image, return_tensors="jax")
outputs = model(**inputs)

📄 License

This project is licensed under the apple - amlr license.

📚 Documentation

For more details about the AIMv2 model, please refer to the [AIMv2 Paper].

📖 Citation

If you find our work useful, please consider citing us as:

@misc{fini2024multimodalautoregressivepretraininglarge,
  author      = {Fini, Enrico and Shukor, Mustafa and Li, Xiujun and Dufter, Philipp and Klein, Michal and Haldimann, David and Aitharaju, Sai and da Costa, Victor Guilherme Turrisi and Béthune, Louis and Gan, Zhe and Toshev, Alexander T and Eichner, Marcin and Nabi, Moin and Yang, Yinfei and Susskind, Joshua M. and El-Nouby, Alaaeldin},
  url         = {https://arxiv.org/abs/2411.14402},
  eprint      = {2411.14402},
  eprintclass = {cs.CV},
  eprinttype  = {arXiv},
  title       = {Multimodal Autoregressive Pre-training of Large Vision Encoders},
  year        = {2024},
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご