Aimv2-huge-patch14-336 Open-Source Vision Model - Multimodal Pretraining to Boost Visual Understanding!

Aimv2 Huge Patch14 336

Developed by apple

AIMv2 is a series of vision models pretrained with multimodal autoregressive objectives, achieving outstanding performance across multiple visual understanding benchmarks.

Image Classification #Multimodal Autoregressive Pretraining #High-Precision Image Classification #Open-Vocabulary Object Detection

Downloads 188

Release Time : 10/29/2024

Model Overview

AIMv2 is an efficient vision model that employs multimodal autoregressive objective pretraining, suitable for image classification and feature extraction tasks.

Model Features

Multimodal Autoregressive Pretraining

Utilizes innovative multimodal autoregressive objectives for pretraining to enhance model performance.

Exceptional Benchmark Performance

Outperforms models like CLIP and SigLIP across multiple visual understanding benchmarks.

Powerful Recognition Capabilities

Achieves high accuracy on datasets such as ImageNet.

Model Capabilities

Image classification

Image feature extraction

Multimodal understanding

Use Cases

Computer Vision

Image Classification

Classifies images and supports multiple datasets.

Achieves 88.2% accuracy on ImageNet-1k

Fine-Grained Classification

Performs fine-grained classification for domain-specific images.

Achieves 96.4% accuracy on Stanford Cars

Medical Imaging

Pathological Image Analysis

Used for classification and analysis of medical images.

Achieves 93.3% accuracy on Camelyon17

🚀 AIMv2 Vision Models

The AIMv2 family of vision models is pre - trained with a multimodal autoregressive objective. It's simple to train and scale effectively, offering excellent performance on various multimodal understanding benchmarks.

📋 Model Information

Property	Details
Library Name	transformers
License	apple - amlr
Metrics	accuracy
Pipeline Tag	image - feature - extraction
Tags	vision, image - feature - extraction, mlx, pytorch

📊 Model Performance

The aimv2 - huge - patch14 - 336 model has the following performance on different classification tasks:

Dataset	Accuracy
imagenet - 1k	88.2
inaturalist - 18	81.0
cifar10	99.3
cifar100	93.6
food101	96.6
dtd	88.8
oxford - pets	96.8
stanford - cars	96.4
camelyon17	93.3
patch - camelyon	89.4
rxrx1	7.2
eurosat	98.7
fmow	63.9
domainnet - infographic	73.4

🚀 Quick Start

Introduction

[AIMv2 Paper] [BibTeX]

We introduce the AIMv2 family of vision models pre - trained with a multimodal autoregressive objective. AIMv2 pre - training is simple and straightforward to train and scale effectively. Some AIMv2 highlights include:

Outperforms OAI CLIP and SigLIP on the majority of multimodal understanding benchmarks.
Outperforms DINOv2 on open - vocabulary object detection and referring expression comprehension.
Exhibits strong recognition performance with AIMv2 - 3B achieving 89.5% on ImageNet using a frozen trunk.

💻 Usage Examples

Basic Usage

PyTorch

import requests
from PIL import Image
from transformers import AutoImageProcessor, AutoModel

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

processor = AutoImageProcessor.from_pretrained(
    "apple/aimv2-huge-patch14-336",
)
model = AutoModel.from_pretrained(
    "apple/aimv2-huge-patch14-336",
    trust_remote_code=True,
)

inputs = processor(images=image, return_tensors="pt")
outputs = model(**inputs)

JAX

import requests
from PIL import Image
from transformers import AutoImageProcessor, FlaxAutoModel

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

processor = AutoImageProcessor.from_pretrained(
    "apple/aimv2-huge-patch14-336",
)
model = FlaxAutoModel.from_pretrained(
    "apple/aimv2-huge-patch14-336",
    trust_remote_code=True,
)

inputs = processor(images=image, return_tensors="jax")
outputs = model(**inputs)

📄 License

The project is licensed under apple - amlr.

📚 Citation

If you find our work useful, please consider citing us as:

@misc{fini2024multimodalautoregressivepretraininglarge,
  author      = {Fini, Enrico and Shukor, Mustafa and Li, Xiujun and Dufter, Philipp and Klein, Michal and Haldimann, David and Aitharaju, Sai and da Costa, Victor Guilherme Turrisi and Béthune, Louis and Gan, Zhe and Toshev, Alexander T and Eichner, Marcin and Nabi, Moin and Yang, Yinfei and Susskind, Joshua M. and El - Nouby, Alaaeldin},
  url         = {https://arxiv.org/abs/2411.14402},
  eprint      = {2411.14402},
  eprintclass = {cs.CV},
  eprinttype  = {arXiv},
  title       = {Multimodal Autoregressive Pre - training of Large Vision Encoders},
  year        = {2024},
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご