🚀 图像特征提取模型AIMv2
AIMv2是一系列经过多模态自回归目标预训练的视觉模型,训练简单直接,可有效扩展。该模型在多数多模态理解基准测试中表现出色,在开放词汇目标检测和指代表达理解等任务中也有优秀表现。
🚀 快速开始
模型信息
属性 |
详情 |
库名称 |
transformers |
模型类型 |
图像特征提取 |
许可证 |
apple-amlr |
评估指标 |
准确率 |
标签 |
视觉、图像特征提取、mlx、pytorch |
模型效果
任务类型 |
数据集 |
准确率 |
分类 |
imagenet-1k |
86.6% |
分类 |
inaturalist-18 |
76.0% |
分类 |
cifar10 |
99.1% |
分类 |
cifar100 |
92.2% |
分类 |
food101 |
95.7% |
分类 |
dtd |
87.9% |
分类 |
oxford-pets |
96.3% |
分类 |
stanford-cars |
96.3% |
分类 |
camelyon17 |
93.7% |
分类 |
patch-camelyon |
89.3% |
分类 |
rxrx1 |
5.6% |
分类 |
eurosat |
98.4% |
分类 |
fmow |
60.7% |
分类 |
domainnet-infographic |
69.0% |
模型亮点
- 在多数多模态理解基准测试中,性能优于OAI CLIP和SigLIP。
- 在开放词汇目标检测和指代表达理解任务上,表现优于DINOv2。
- AIMv2 - 3B模型使用冻结主干在ImageNet上达到89.5%的准确率。
模型概览图
💻 使用示例
基础用法 - PyTorch
import requests
from PIL import Image
from transformers import AutoImageProcessor, AutoModel
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
processor = AutoImageProcessor.from_pretrained(
"apple/aimv2-large-patch14-224",
)
model = AutoModel.from_pretrained(
"apple/aimv2-large-patch14-224",
trust_remote_code=True,
)
inputs = processor(images=image, return_tensors="pt")
outputs = model(**inputs)
高级用法 - JAX
import requests
from PIL import Image
from transformers import AutoImageProcessor, FlaxAutoModel
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
processor = AutoImageProcessor.from_pretrained(
"apple/aimv2-large-patch14-224",
)
model = FlaxAutoModel.from_pretrained(
"apple/aimv2-large-patch14-224",
trust_remote_code=True,
)
inputs = processor(images=image, return_tensors="jax")
outputs = model(**inputs)
📚 详细文档
@misc{fini2024multimodalautoregressivepretraininglarge,
author = {Fini, Enrico and Shukor, Mustafa and Li, Xiujun and Dufter, Philipp and Klein, Michal and Haldimann, David and Aitharaju, Sai and da Costa, Victor Guilherme Turrisi and Béthune, Louis and Gan, Zhe and Toshev, Alexander T and Eichner, Marcin and Nabi, Moin and Yang, Yinfei and Susskind, Joshua M. and El-Nouby, Alaaeldin},
url = {https://arxiv.org/abs/2411.14402},
eprint = {2411.14402},
eprintclass = {cs.CV},
eprinttype = {arXiv},
title = {Multimodal Autoregressive Pre-training of Large Vision Encoders},
year = {2024},
}
如果你觉得我们的工作有帮助,请考虑引用我们的论文。