Free and open source! The aimv2-large-patch14-448 vision model, a practical choice with excellent performance in multiple benchmark tests.

Aimv2 Large Patch14 448

Developed by apple

AIMv2 is a series of vision models based on multimodal autoregressive objective pretraining, excelling in multiple benchmarks

Image Classification #Multimodal Autoregressive Pretraining #High-Precision Image Classification #Open-Vocabulary Understanding

Downloads 2,210

Release Time : 10/29/2024

Model Overview

AIMv2 employs multimodal autoregressive objectives for pretraining, demonstrating strong performance in vision tasks like image classification and object detection

Model Features

Multimodal Autoregressive Pretraining

Uses innovative multimodal autoregressive objectives for pretraining to enhance model comprehension

Outstanding Performance

Surpasses mainstream vision models like CLIP, SigLIP, and DINOv2 across multiple benchmarks

Large-Scale Scalability

Simple and direct pretraining method enables effective training scale expansion

Model Capabilities

Image feature extraction

Image classification

Multimodal understanding

Open-vocabulary object detection

Referring expression comprehension

Use Cases

Computer Vision

Image Classification

Performs image classification tasks on datasets like ImageNet

Achieves 87.9% accuracy on ImageNet-1k

Fine-Grained Classification

Handles domain-specific fine-grained image classification tasks

Achieves 96.6% accuracy on Stanford Cars

Medical Image Analysis

Processes medical image classification tasks

Achieves 94.1% accuracy on Camelyon17

Remote Sensing Image Processing

Satellite Image Classification

Handles satellite and aerial image classification tasks

Achieves 98.6% accuracy on EuroSAT

🚀 AIMv2 Vision Models

The AIMv2 family of vision models is pre - trained with a multimodal autoregressive objective, offering high - performance solutions for various vision tasks.

🚀 Quick Start

The AIMv2 models are designed to be used with the transformers library. They are pre - trained using a multimodal autoregressive objective, which makes them effective for a wide range of vision tasks.

✨ Features

Multimodal Performance: Outperforms OAI CLIP and SigLIP on the majority of multimodal understanding benchmarks.
Open - Vocabulary Tasks: Outperforms DINOv2 on open - vocabulary object detection and referring expression comprehension.
Strong Recognition: AIMv2 - 3B achieves 89.5% on ImageNet using a frozen trunk.

📦 Installation

Since the models are part of the transformers library, you can install it using pip:

pip install transformers

💻 Usage Examples

Basic Usage

PyTorch

import requests
from PIL import Image
from transformers import AutoImageProcessor, AutoModel

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

processor = AutoImageProcessor.from_pretrained(
    "apple/aimv2-large-patch14-448",
)
model = AutoModel.from_pretrained(
    "apple/aimv2-large-patch14-448",
    trust_remote_code=True,
)

inputs = processor(images=image, return_tensors="pt")
outputs = model(**inputs)

JAX

import requests
from PIL import Image
from transformers import AutoImageProcessor, FlaxAutoModel

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

processor = AutoImageProcessor.from_pretrained(
    "apple/aimv2-large-patch14-448",
)
model = FlaxAutoModel.from_pretrained(
    "apple/aimv2-large-patch14-448",
    trust_remote_code=True,
)

inputs = processor(images=image, return_tensors="jax")
outputs = model(**inputs)

📚 Documentation

Model Information

Property	Details
Library Name	transformers
Model Type	image - feature - extraction
License	apple - amlr
Metrics	accuracy
Tags	vision, image - feature - extraction, mlx, pytorch

Performance Metrics

The aimv2-large-patch14-448 model has the following performance on different classification tasks:

Dataset	Accuracy
imagenet - 1k	87.9
inaturalist - 18	81.3
cifar10	99.1
cifar100	92.4
food101	96.6
dtd	88.9
oxford - pets	96.5
stanford - cars	96.6
camelyon17	94.1
patch - camelyon	89.6
rxrx1	7.4
eurosat	98.6
fmow	62.8
domainnet - infographic	72.7

📄 License

This project is licensed under the apple - amlr license.

📖 Citation

If you find our work useful, please consider citing us as:

@misc{fini2024multimodalautoregressivepretraininglarge,
  author      = {Fini, Enrico and Shukor, Mustafa and Li, Xiujun and Dufter, Philipp and Klein, Michal and Haldimann, David and Aitharaju, Sai and da Costa, Victor Guilherme Turrisi and Béthune, Louis and Gan, Zhe and Toshev, Alexander T and Eichner, Marcin and Nabi, Moin and Yang, Yinfei and Susskind, Joshua M. and El-Nouby, Alaaeldin},
  url         = {https://arxiv.org/abs/2411.14402},
  eprint      = {2411.14402},
  eprintclass = {cs.CV},
  eprinttype  = {arXiv},
  title       = {Multimodal Autoregressive Pre-training of Large Vision Encoders},
  year        = {2024},
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご