aimv2-3B-patch14-448开源视觉模型 - 多模态预训练助力高效视觉理解

首页

Aimv2 3B Patch14 448

由 apple 开发

AIMv2是通过多模态自回归目标预训练的视觉模型系列，在多个视觉理解基准测试中表现优异。

图像分类 #多模态自回归预训练 #高精度图像分类 #开放词汇检测

下载量 161

发布时间 : 10/29/2024

模型简介

AIMv2系列视觉模型通过多模态自回归目标进行预训练，具有强大的图像特征提取和分类能力，在多个基准测试中优于同类模型。

模型特点

多模态自回归预训练

采用多模态自回归目标进行预训练，有效提升模型性能。

卓越的分类性能

在多个基准测试中优于OpenAI CLIP、SigLIP和DINOv2等模型。

大规模参数

3B参数的模型规模，具备强大的特征提取能力。

模型能力

图像特征提取

图像分类

多模态理解

使用案例

计算机视觉

图像分类

在ImageNet等数据集上进行高精度图像分类。

ImageNet-1k准确率89.5%

细粒度分类

在stanford-cars等细粒度分类任务中表现优异。

stanford-cars准确率96.7%

医学影像

病理图像分析

在camelyon17等医学影像数据集上进行分类。

camelyon17准确率93.4%

🚀 图像特征提取模型AIMv2

AIMv2是一系列基于多模态自回归目标进行预训练的视觉模型，训练和扩展简单直接且高效。在多数多模态理解基准测试中表现出色，在开放词汇对象检测和指代表达理解等任务中也有优秀表现。

🚀 快速开始

模型信息

属性	详情
库名称	transformers
许可证	apple-amlr
评估指标	准确率
任务类型	图像特征提取
标签	视觉、图像特征提取、mlx、pytorch

模型性能

任务	数据集	准确率	是否验证
分类	imagenet-1k	89.5%	否
分类	inaturalist-18	85.9%	否
分类	cifar10	99.5%	否
分类	cifar100	94.5%	否
分类	food101	97.4%	否
分类	dtd	89.0%	否
分类	oxford-pets	97.4%	否
分类	stanford-cars	96.7%	否
分类	camelyon17	93.4%	否
分类	patch-camelyon	89.9%	否
分类	rxrx1	9.5%	否
分类	eurosat	98.9%	否
分类	fmow	66.1%	否
分类	domainnet-infographic	74.8%	否

模型亮点

在多数多模态理解基准测试中，性能优于OAI CLIP和SigLIP。
在开放词汇对象检测和指代表达理解任务中，表现优于DINOv2。
AIMv2 - 3B在使用冻结主干的情况下，在ImageNet上达到了89.5%的准确率。

模型概述图

AIMv2 Overview

💻 使用示例

基础用法 - PyTorch

import requests
from PIL import Image
from transformers import AutoImageProcessor, AutoModel

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

processor = AutoImageProcessor.from_pretrained(
    "apple/aimv2-3B-patch14-448",
)
model = AutoModel.from_pretrained(
    "apple/aimv2-3B-patch14-448",
    trust_remote_code=True,
)

inputs = processor(images=image, return_tensors="pt")
outputs = model(**inputs)

基础用法 - JAX

import requests
from PIL import Image
from transformers import AutoImageProcessor, FlaxAutoModel

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

processor = AutoImageProcessor.from_pretrained(
    "apple/aimv2-3B-patch14-448",
)
model = FlaxAutoModel.from_pretrained(
    "apple/aimv2-3B-patch14-448",
    trust_remote_code=True,
)

inputs = processor(images=image, return_tensors="jax")
outputs = model(**inputs)

📄 许可证

本项目使用的许可证为apple-amlr。

📚 详细文档

论文引用

[AIMv2 Paper] [BibTeX]

如果您觉得我们的工作有用，请考虑引用我们的论文：

@misc{fini2024multimodalautoregressivepretraininglarge,
  author      = {Fini, Enrico and Shukor, Mustafa and Li, Xiujun and Dufter, Philipp and Klein, Michal and Haldimann, David and Aitharaju, Sai and da Costa, Victor Guilherme Turrisi and Béthune, Louis and Gan, Zhe and Toshev, Alexander T and Eichner, Marcin and Nabi, Moin and Yang, Yinfei and Susskind, Joshua M. and El-Nouby, Alaaeldin},
  url         = {https://arxiv.org/abs/2411.14402},
  eprint      = {2411.14402},
  eprintclass = {cs.CV},
  eprinttype  = {arXiv},
  title       = {Multimodal Autoregressive Pre-training of Large Vision Encoders},
  year        = {2024},
}