AIMv2-Huge-Patch14-448 Open-Source Vision Model - Multimodal Pretraining, Good Performance in Benchmark Tests

Aimv2 Huge Patch14 448

Developed by apple

AIMv2 is a series of vision models pretrained with multimodal autoregressive objectives, demonstrating excellent performance across multiple benchmarks.

Image Classification #Multimodal Autoregressive Pretraining #High-Precision Image Classification #Open-Vocabulary Understanding

Downloads 1,672

Release Time : 10/29/2024

Model Overview

AIMv2 is an efficient vision model pretrained using multimodal autoregressive objectives, excelling in tasks such as image classification and feature extraction.

Model Features

Multimodal Autoregressive Pretraining

Utilizes innovative multimodal autoregressive objectives for pretraining to enhance model performance

Outstanding Benchmark Performance

Surpasses models like CLIP, SigLIP, and DINOv2 across multiple vision benchmarks

Powerful Recognition Capabilities

Achieves 89.5% accuracy on ImageNet, demonstrating exceptional recognition performance

Model Capabilities

Image feature extraction

Image classification

Multimodal understanding

Open-vocabulary object detection

Referring expression comprehension

Use Cases

Computer Vision

Image Classification

Classify and recognize images

Achieves 88.6% accuracy on ImageNet-1k

Natural Image Recognition

Identify objects in natural scenes

Achieves 82.8% accuracy on iNaturalist-18

Fine-Grained Classification

Perform fine-grained object classification

Achieves 96.5% accuracy on Stanford Cars

Medical Imaging

Pathological Image Analysis

Analyze medical pathological images

Achieves 93.4% accuracy on Camelyon17

🚀 Transformers Library for Image Feature Extraction

This library offers the AIMv2 family of vision models pre - trained with a multimodal autoregressive objective. These models are simple to train, scale effectively, and demonstrate excellent performance on various multimodal understanding benchmarks.

🚀 Quick Start

📦 Installation

The installation steps are not provided in the original document, so this section is skipped.

💻 Usage Examples

Basic Usage

The following examples demonstrate how to use the AIMv2 model for image feature extraction in both PyTorch and JAX.

PyTorch

import requests
from PIL import Image
from transformers import AutoImageProcessor, AutoModel

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

processor = AutoImageProcessor.from_pretrained(
    "apple/aimv2-huge-patch14-448",
)
model = AutoModel.from_pretrained(
    "apple/aimv2-huge-patch14-448",
    trust_remote_code=True,
)

inputs = processor(images=image, return_tensors="pt")
outputs = model(**inputs)

JAX

import requests
from PIL import Image
from transformers import AutoImageProcessor, FlaxAutoModel

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

processor = AutoImageProcessor.from_pretrained(
    "apple/aimv2-huge-patch14-448",
)
model = FlaxAutoModel.from_pretrained(
    "apple/aimv2-huge-patch14-448",
    trust_remote_code=True,
)

inputs = processor(images=image, return_tensors="jax")
outputs = model(**inputs)

✨ Features

Outperforms OAI CLIP and SigLIP on the majority of multimodal understanding benchmarks.
Outperforms DINOv2 on open - vocabulary object detection and referring expression comprehension.
Exhibits strong recognition performance with AIMv2 - 3B achieving 89.5% on ImageNet using a frozen trunk.

📚 Documentation

Model Information

Property	Details
Library Name	transformers
Model Type	aimv2 - huge - patch14 - 448
Pipeline Tag	image - feature - extraction
Tags	vision, image - feature - extraction, mlx, pytorch
Metrics	accuracy
License	apple - amlr

Performance Metrics

The aimv2 - huge - patch14 - 448 model has been tested on multiple datasets for classification tasks, and the accuracy metrics are as follows:

Dataset	Accuracy
imagenet - 1k	88.6
inaturalist - 18	82.8
cifar10	99.4
cifar100	93.6
food101	97.0
dtd	88.9
oxford - pets	96.8
stanford - cars	96.5
camelyon17	93.4
patch - camelyon	89.6
rxrx1	7.8
eurosat	98.7
fmow	64.8
domainnet - infographic	74.5

Introduction

[AIMv2 Paper] [BibTeX]

We introduce the AIMv2 family of vision models pre - trained with a multimodal autoregressive objective. AIMv2 pre - training is simple and straightforward to train and scale effectively.

📄 License

The library is licensed under the apple - amlr license.

📄 Citation

If you find our work useful, please consider citing us as:

@misc{fini2024multimodalautoregressivepretraininglarge,
  author      = {Fini, Enrico and Shukor, Mustafa and Li, Xiujun and Dufter, Philipp and Klein, Michal and Haldimann, David and Aitharaju, Sai and da Costa, Victor Guilherme Turrisi and Béthune, Louis and Gan, Zhe and Toshev, Alexander T and Eichner, Marcin and Nabi, Moin and Yang, Yinfei and Susskind, Joshua M. and El - Nouby, Alaaeldin},
  url         = {https://arxiv.org/abs/2411.14402},
  eprint      = {2411.14402},
  eprintclass = {cs.CV},
  eprinttype  = {arXiv},
  title       = {Multimodal Autoregressive Pre - training of Large Vision Encoders},
  year        = {2024},
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご