Open-source vision-perceiver-fourier model - Designed specifically for image classification, pre-trained on the ImageNet dataset

Vision Perceiver Fourier

Developed by deepmind

Perceiver IO is a general-purpose Transformer architecture capable of processing multiple modalities. This model is specifically designed for image classification tasks and pretrained on the ImageNet dataset.

Image Classification

Transformers

Open Source License:Apache-2.0 #Multimodal Processing #Image Classification #Self-Attention Mechanism

Downloads 1,168

Release Time : 3/2/2022

Model Overview

This model employs cross-attention mechanisms to process raw pixel values without image patching, achieving efficient image classification through fixed Fourier position embeddings.

Model Features

Modality-Agnostic Architecture

Core design applicable to various data types including text, images, and audio.

Efficient Attention Mechanism

Achieves input-size-independent self-attention computational complexity through latent vectors.

Raw Pixel Processing

Directly processes raw pixel values without ViT-style image patching preprocessing.

Flexible Decoding

Supports multiple output formats and tasks through decoding query mechanisms.

Model Capabilities

Image Classification

Feature Extraction

Use Cases

Computer Vision

Image Classification

Performs 1000-class ImageNet classification on input images.

79.0 top-1 accuracy on ImageNet-1k

Transfer Learning

Used as a pretrained model for downstream vision tasks.

🚀 Perceiver IO for vision (fixed Fourier position embeddings)

A Perceiver IO model pre - trained on ImageNet, designed for vision tasks with fixed Fourier position embeddings.

🚀 Quick Start

The Perceiver IO model presented here is pre - trained on ImageNet (14 million images, 1,000 classes) at a resolution of 224x224. It was introduced in the paper Perceiver IO: A General Architecture for Structured Inputs & Outputs by Jaegle et al. and first released in this repository.

Disclaimer: The team releasing Perceiver IO did not write a model card for this model, so this model card has been written by the Hugging Face team.

✨ Features

Modality - agnostic: Perceiver IO is a transformer encoder model applicable to any modality such as text, images, audio, video, etc.
Efficient self - attention: It employs the self - attention mechanism on a relatively small set of latent vectors, making the time and memory requirements independent of the input size.
Flexible decoding: Decoder queries are used to flexibly decode the final hidden states of the latents to produce outputs of arbitrary size and semantics.

📦 Installation

No specific installation steps are provided in the original README.

💻 Usage Examples

Basic Usage

Here is how to use this model in PyTorch for image classification:

from transformers import PerceiverImageProcessor, PerceiverForImageClassificationFourier
import requests
from PIL import Image

processor = PerceiverImageProcessor.from_pretrained("deepmind/vision-perceiver-fourier")
model = PerceiverForImageClassificationFourier.from_pretrained("deepmind/vision-perceiver-fourier")

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

# prepare input
inputs = processor(image, return_tensors="pt").pixel_values
# forward pass
outputs = model(inputs)
logits = outputs.logits
print("Predicted class:", model.config.id2label[logits.argmax(-1).item()])
>>> should print Predicted class: tabby, tabby cat

📚 Documentation

Model description

Perceiver IO is a transformer encoder model that can handle various modalities. The core concept is to apply the self - attention mechanism on a limited set of latent vectors (e.g., 256 or 512) and use the inputs only for cross - attention with the latents. This approach ensures that the time and memory requirements of the self - attention mechanism are not dependent on the input size.

To decode, the authors use decoder queries, which can flexibly transform the final hidden states of the latents into outputs of any size and semantics. For image classification, the output is a tensor of logits with the shape (batch_size, num_labels).

Perceiver IO architecture.

Since the self - attention mechanism's time and memory requirements are independent of the input size, the Perceiver IO authors can train the model directly on raw pixel values, unlike ViT which uses patches. This specific model adds fixed Fourier 2D position embeddings to the pixel values.

By pre - training the model, it learns an internal representation of images that can be used to extract features for downstream tasks. For example, if you have a labeled image dataset, you can train a standard classifier by replacing the classification decoder.

Intended uses & limitations

You can use the raw model for image classification. Check the model hub for other fine - tuned versions for tasks that may interest you.

Training data

This model was pretrained on ImageNet, a dataset with 14 million images and 1k classes.

Training procedure

Preprocessing

Images are center - cropped, resized to a resolution of 224x224, and normalized across the RGB channels. Data augmentation was used during pre - training, as detailed in Appendix H of the paper.

Pretraining

Hyperparameter details can be found in Appendix H of the paper.

Evaluation results

This model can achieve a top - 1 accuracy of 79.0 on ImageNet - 1k and 84.5 when pre - trained on a large - scale dataset (JFT - 300M, an internal dataset of Google).

BibTeX entry and citation info

@article{DBLP:journals/corr/abs-2107-14795,
  author    = {Andrew Jaegle and
               Sebastian Borgeaud and
               Jean{-}Baptiste Alayrac and
               Carl Doersch and
               Catalin Ionescu and
               David Ding and
               Skanda Koppula and
               Daniel Zoran and
               Andrew Brock and
               Evan Shelhamer and
               Olivier J. H{\'{e}}naff and
               Matthew M. Botvinick and
               Andrew Zisserman and
               Oriol Vinyals and
               Jo{\~{a}}o Carreira},
  title     = {Perceiver {IO:} {A} General Architecture for Structured Inputs {\&}
               Outputs},
  journal   = {CoRR},
  volume    = {abs/2107.14795},
  year      = {2021},
  url       = {https://arxiv.org/abs/2107.14795},
  eprinttype = {arXiv},
  eprint    = {2107.14795},
  timestamp = {Tue, 03 Aug 2021 14:53:34 +0200},
  biburl    = {https://dblp.org/rec/journals/corr/abs-2107-14795.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

📄 License

This model is licensed under the Apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご