đ Perceiver IO for vision (fixed Fourier position embeddings)
A Perceiver IO model pre - trained on ImageNet, designed for vision tasks with fixed Fourier position embeddings.
đ Quick Start
The Perceiver IO model presented here is pre - trained on ImageNet (14 million images, 1,000 classes) at a resolution of 224x224. It was introduced in the paper Perceiver IO: A General Architecture for Structured Inputs & Outputs by Jaegle et al. and first released in this repository.
Disclaimer: The team releasing Perceiver IO did not write a model card for this model, so this model card has been written by the Hugging Face team.
⨠Features
- Modality - agnostic: Perceiver IO is a transformer encoder model applicable to any modality such as text, images, audio, video, etc.
- Efficient self - attention: It employs the self - attention mechanism on a relatively small set of latent vectors, making the time and memory requirements independent of the input size.
- Flexible decoding: Decoder queries are used to flexibly decode the final hidden states of the latents to produce outputs of arbitrary size and semantics.
đĻ Installation
No specific installation steps are provided in the original README.
đģ Usage Examples
Basic Usage
Here is how to use this model in PyTorch for image classification:
from transformers import PerceiverImageProcessor, PerceiverForImageClassificationFourier
import requests
from PIL import Image
processor = PerceiverImageProcessor.from_pretrained("deepmind/vision-perceiver-fourier")
model = PerceiverForImageClassificationFourier.from_pretrained("deepmind/vision-perceiver-fourier")
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
inputs = processor(image, return_tensors="pt").pixel_values
outputs = model(inputs)
logits = outputs.logits
print("Predicted class:", model.config.id2label[logits.argmax(-1).item()])
>>> should print Predicted class: tabby, tabby cat
đ Documentation
Model description
Perceiver IO is a transformer encoder model that can handle various modalities. The core concept is to apply the self - attention mechanism on a limited set of latent vectors (e.g., 256 or 512) and use the inputs only for cross - attention with the latents. This approach ensures that the time and memory requirements of the self - attention mechanism are not dependent on the input size.
To decode, the authors use decoder queries, which can flexibly transform the final hidden states of the latents into outputs of any size and semantics. For image classification, the output is a tensor of logits with the shape (batch_size, num_labels).
Perceiver IO architecture.
Since the self - attention mechanism's time and memory requirements are independent of the input size, the Perceiver IO authors can train the model directly on raw pixel values, unlike ViT which uses patches. This specific model adds fixed Fourier 2D position embeddings to the pixel values.
By pre - training the model, it learns an internal representation of images that can be used to extract features for downstream tasks. For example, if you have a labeled image dataset, you can train a standard classifier by replacing the classification decoder.
Intended uses & limitations
You can use the raw model for image classification. Check the model hub for other fine - tuned versions for tasks that may interest you.
Training data
This model was pretrained on ImageNet, a dataset with 14 million images and 1k classes.
Training procedure
Preprocessing
Images are center - cropped, resized to a resolution of 224x224, and normalized across the RGB channels. Data augmentation was used during pre - training, as detailed in Appendix H of the paper.
Pretraining
Hyperparameter details can be found in Appendix H of the paper.
Evaluation results
This model can achieve a top - 1 accuracy of 79.0 on ImageNet - 1k and 84.5 when pre - trained on a large - scale dataset (JFT - 300M, an internal dataset of Google).
BibTeX entry and citation info
@article{DBLP:journals/corr/abs-2107-14795,
author = {Andrew Jaegle and
Sebastian Borgeaud and
Jean{-}Baptiste Alayrac and
Carl Doersch and
Catalin Ionescu and
David Ding and
Skanda Koppula and
Daniel Zoran and
Andrew Brock and
Evan Shelhamer and
Olivier J. H{\'{e}}naff and
Matthew M. Botvinick and
Andrew Zisserman and
Oriol Vinyals and
Jo{\~{a}}o Carreira},
title = {Perceiver {IO:} {A} General Architecture for Structured Inputs {\&}
Outputs},
journal = {CoRR},
volume = {abs/2107.14795},
year = {2021},
url = {https://arxiv.org/abs/2107.14795},
eprinttype = {arXiv},
eprint = {2107.14795},
timestamp = {Tue, 03 Aug 2021 14:53:34 +0200},
biburl = {https://dblp.org/rec/journals/corr/abs-2107-14795.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
đ License
This model is licensed under the Apache - 2.0 license.