AIMV2-Huge-Patch14-224 Open-Source Vision Model - Multi-Modal Pre-Training for Excellent Benchmark Performance

Aimv2 Huge Patch14 224

Developed by apple

The AIMv2 series are vision models pretrained with multimodal autoregressive objectives, demonstrating excellent performance across multiple benchmarks.

Image Classification #Multimodal Autoregressive Pretraining #High-precision Image Classification #Open-vocabulary Object Detection

Downloads 54

Release Time : 10/29/2024

Model Overview

AIMv2 is an advanced vision model employing multimodal autoregressive pretraining, excelling in image classification and feature extraction tasks.

Model Features

Multimodal Autoregressive Pretraining

Utilizes innovative multimodal autoregressive objectives for pretraining to enhance model performance

Outstanding Benchmark Performance

Outperforms models like CLIP, SigLIP, and DINOv2 on multiple vision benchmarks

Large-scale Scalability

Simple and straightforward pretraining method enables effective training scale expansion

Model Capabilities

Image classification

Image feature extraction

Multimodal understanding

Open-vocabulary object detection

Referring expression comprehension

Use Cases

Computer Vision

Image Classification

High-precision image classification on datasets like ImageNet

Achieves 87.5% accuracy on ImageNet-1k

Fine-grained Classification

Fine-grained image classification for specific domains

Achieves 96.4% accuracy on stanford-cars

Medical Image Analysis

Medical image classification and analysis

Achieves 93.3% accuracy on camelyon17

Multimodal Applications

Open-vocabulary Object Detection

Detects objects in images not explicitly labeled in the training set

Outperforms DINOv2

Referring Expression Comprehension

Understands natural language referring expressions and locates corresponding regions in images

Outperforms DINOv2

🚀 AIMv2 Vision Models

This project introduces the AIMv2 family of vision models pre - trained with a multimodal autoregressive objective. These models are simple and straightforward to train and can scale effectively, offering high - performance results in various multimodal understanding and recognition tasks.

✨ Features

Outperforms OAI CLIP and SigLIP on the majority of multimodal understanding benchmarks.
Outperforms DINOv2 on open - vocabulary object detection and referring expression comprehension.
Exhibits strong recognition performance with AIMv2 - 3B achieving 89.5% on ImageNet using a frozen trunk.

📦 Model Information

Property	Details
Library Name	transformers
License	apple - amlr
Metrics	accuracy
Pipeline Tag	image - feature - extraction
Tags	vision, image - feature - extraction, mlx, pytorch
Model Name	aimv2 - huge - patch14 - 224

📊 Model Results

Task	Dataset	Accuracy
Classification	imagenet - 1k	87.5
Classification	inaturalist - 18	77.9
Classification	cifar10	99.3
Classification	cifar100	93.5
Classification	food101	96.3
Classification	dtd	88.2
Classification	oxford - pets	96.6
Classification	stanford - cars	96.4
Classification	camelyon17	93.3
Classification	patch - camelyon	89.3
Classification	rxrx1	5.8
Classification	eurosat	98.5
Classification	fmow	62.2
Classification	domainnet - infographic	70.4

💻 Usage Examples

Basic Usage

PyTorch

import requests
from PIL import Image
from transformers import AutoImageProcessor, AutoModel

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

processor = AutoImageProcessor.from_pretrained(
    "apple/aimv2-huge-patch14-224",
)
model = AutoModel.from_pretrained(
    "apple/aimv2-huge-patch14-224",
    trust_remote_code=True,
)

inputs = processor(images=image, return_tensors="pt")
outputs = model(**inputs)

JAX

import requests
from PIL import Image
from transformers import AutoImageProcessor, FlaxAutoModel

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

processor = AutoImageProcessor.from_pretrained(
    "apple/aimv2-huge-patch14-224",
)
model = FlaxAutoModel.from_pretrained(
    "apple/aimv2-huge-patch14-224",
    trust_remote_code=True,
)

inputs = processor(images=image, return_tensors="jax")
outputs = model(**inputs)

📄 License

The project is under the apple - amlr license.

📚 Documentation

[AIMv2 Paper] [BibTeX]

📖 Citation

If you find our work useful, please consider citing us as:

@misc{fini2024multimodalautoregressivepretraininglarge,
  author      = {Fini, Enrico and Shukor, Mustafa and Li, Xiujun and Dufter, Philipp and Klein, Michal and Haldimann, David and Aitharaju, Sai and da Costa, Victor Guilherme Turrisi and Béthune, Louis and Gan, Zhe and Toshev, Alexander T and Eichner, Marcin and Nabi, Moin and Yang, Yinfei and Susskind, Joshua M. and El-Nouby, Alaaeldin},
  url         = {https://arxiv.org/abs/2411.14402},
  eprint      = {2411.14402},
  eprintclass = {cs.CV},
  eprinttype  = {arXiv},
  title       = {Multimodal Autoregressive Pre-training of Large Vision Encoders},
  year        = {2024},
}

AIMv2 Overview

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご