Open-source Vision Model aimv2-3B-patch14-336: Excellent Multimodal Understanding for Application Boost

Aimv2 3B Patch14 336

Developed by apple

AIMv2 is a series of vision models pretrained with multimodal autoregressive objectives, achieving excellent performance across multiple multimodal understanding benchmarks.

Image Classification #Multimodal Autoregressive Pretraining #High-Precision Image Classification #Open-Vocabulary Object Detection

Downloads 23

Release Time : 10/29/2024

Model Overview

AIMv2 is an efficient vision model pretrained with multimodal autoregressive objectives, excelling in tasks such as image classification and object detection.

Model Features

Multimodal Autoregressive Pretraining

Utilizes multimodal autoregressive objectives for pretraining, enhancing model comprehension capabilities

High Performance

Outperforms models like CLIP, SigLIP, and DINOv2 on multiple benchmarks

Large-Scale Scalability

Simple and direct pretraining method enables effective scaling of training size

Model Capabilities

Image feature extraction

Image classification

Open-vocabulary object detection

Referring expression comprehension

Use Cases

Computer Vision

Image Classification

High-precision image classification on datasets like ImageNet

ImageNet-1k accuracy 89.2%

Fine-Grained Classification

Classification on domain-specific datasets such as stanford-cars

stanford-cars accuracy 96.6%

Medical Imaging

Pathology Image Analysis

Analysis on medical imaging datasets like camelyon17

camelyon17 accuracy 93.2%

🚀 AIMv2 Vision Model

The AIMv2 family of vision models is pre - trained with a multimodal autoregressive objective, offering high - performance solutions for various vision tasks.

🚀 Quick Start

This README provides an introduction to the AIMv2 family of vision models, including their performance metrics, usage examples, and citation information.

✨ Features

High - performance on benchmarks: Outperforms OAI CLIP, SigLIP on most multimodal understanding benchmarks and DINOv2 on open - vocabulary object detection and referring expression comprehension.
Strong recognition ability: The AIMv2 - 3B model achieves 89.5% accuracy on ImageNet using a frozen trunk.
Multimodal pre - training: Trained with a multimodal autoregressive objective, which is simple and easy to scale.

📦 Model Information

Property	Details
Library Name	transformers
License	apple - amlr
Metrics	accuracy
Pipeline Tag	image - feature - extraction
Tags	vision, image - feature - extraction, mlx, pytorch

Model Performance

The aimv2 - 3B - patch14 - 336 model has the following performance on different classification tasks:

Task	Dataset	Accuracy
Classification	imagenet - 1k	89.2
Classification	inaturalist - 18	84.4
Classification	cifar10	99.5
Classification	cifar100	94.4
Classification	food101	97.2
Classification	dtd	89.3
Classification	oxford - pets	97.2
Classification	stanford - cars	96.6
Classification	camelyon17	93.2
Classification	patch - camelyon	89.3
Classification	rxrx1	8.8
Classification	eurosat	99.0
Classification	fmow	65.7
Classification	domainnet - infographic	74.0

💻 Usage Examples

Basic Usage - PyTorch

import requests
from PIL import Image
from transformers import AutoImageProcessor, AutoModel

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

processor = AutoImageProcessor.from_pretrained(
    "apple/aimv2-3B-patch14-336",
)
model = AutoModel.from_pretrained(
    "apple/aimv2-3B-patch14-336",
    trust_remote_code=True,
)

inputs = processor(images=image, return_tensors="pt")
outputs = model(**inputs)

Basic Usage - JAX

import requests
from PIL import Image
from transformers import AutoImageProcessor, FlaxAutoModel

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

processor = AutoImageProcessor.from_pretrained(
    "apple/aimv2-3B-patch14-336",
)
model = FlaxAutoModel.from_pretrained(
    "apple/aimv2-3B-patch14-336",
    trust_remote_code=True,
)

inputs = processor(images=image, return_tensors="jax")
outputs = model(**inputs)

📄 License

This project is licensed under the apple - amlr license.

📚 Documentation

For more information about the AIMv2 models, please refer to the [AIMv2 Paper].

📖 Citation

If you find our work useful, please consider citing us as:

@misc{fini2024multimodalautoregressivepretraininglarge,
  author      = {Fini, Enrico and Shukor, Mustafa and Li, Xiujun and Dufter, Philipp and Klein, Michal and Haldimann, David and Aitharaju, Sai and da Costa, Victor Guilherme Turrisi and Béthune, Louis and Gan, Zhe and Toshev, Alexander T and Eichner, Marcin and Nabi, Moin and Yang, Yinfei and Susskind, Joshua M. and El-Nouby, Alaaeldin},
  url         = {https://arxiv.org/abs/2411.14402},
  eprint      = {2411.14402},
  eprintclass = {cs.CV},
  eprinttype  = {arXiv},
  title       = {Multimodal Autoregressive Pre-training of Large Vision Encoders},
  year        = {2024},
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご