RADIO Open-Source Visual Feature Extraction Model - Convert Images to Embedding Vectors for Free and Support Downstream Tasks

Home

RADIO

Developed by nvidia

A visual feature extraction model developed by NVIDIA that converts images into embedding vectors for downstream tasks

Transformers

#Multi-resolution visual feature extraction #Cross-domain universal embedding #Dynamic block size adaptation

Downloads 5,166

Release Time : 12/11/2023

Model Overview

An image feature extraction model based on Vision Transformer architecture, supporting flexible input resolutions, with generated embeddings suitable for computer vision tasks such as image classification and semantic segmentation

Model Features

Flexible input resolution

Supports input resolutions up to 2048x2048 (in 16-pixel increments), adapting to various application scenarios

Dual output features

Simultaneously outputs global features (summary) and local spatial features (spatial_features) to meet different task requirements

Large-scale pre-training

Pre-trained on the DataComp dataset with 128 billion internet images, possessing powerful feature extraction capabilities

Model Capabilities

Image feature extraction

Image classification

Semantic segmentation

Visual embedding generation

Use Cases

Computer Vision

Image classification

Using RADIO-extracted image embeddings as input for downstream classifiers

Semantic segmentation

Utilizing RADIO's spatial features for dense prediction tasks

🚀 AM-RADIO: Reduce All Domains Into One

This model performs visual feature extraction, generating image embeddings for downstream image classification tasks. It's for research and development only.

🚀 Quick Start

To pull the model from HuggingFace, you first need to log in:

huggingface-cli login

Then you can pull the model from a Python script:

from transformers import AutoModel
model = AutoModel.from_pretrained("nvidia/RADIO", trust_remote_code=True)

Alternatively, you can specify an access token:

access_token = "<YOUR ACCESS TOKEN"
model = AutoModel.from_pretrained("nvidia/RADIO", trust_remote_code=True, token=access_token)

✨ Features

Performs visual feature extraction, generating image embeddings for downstream models to classify images.
Flexible in input dimension, supporting inputs with width and height in the range $[14, 1008]$ as long as both axes are divisible by 14.

📦 Installation

No specific installation steps other than the HuggingFace model pulling process are provided in the original document.

💻 Usage Examples

Basic Usage

RADIO will return a tuple with two tensors. The summary is similar to the cls_token in ViT and is meant to represent the general concept of the entire image. It has shape $(B,C)$ with $B$ being the batch dimension, and $C$ being some number of channels. The spatial_features represent more localized content which should be suitable for dense tasks such as semantic segmentation, or for integration into an LLM. It has shape $(B,T,D)$ with $T$ being the flattened spatial tokens, and $D$ being the channels for spatial features. Note that $C \neq D$ in general.

Converting to a spatial tensor format can be done using the downsampling size of the model, combined with the input tensor shape. For 'radio_v1', the patch size is 14.

from einops import rearrange
spatial_features = rearrange(spatial_features, 'b (h w) d -> b d h w', h=x.shape[-2] // patch_size, w=x.shape[-1] // patch_size)

The resulting tensor will have shape $(B,D,H,W)$, as is typically seen with computer vision models.

Advanced Usage

We have trained this model to be flexible in input dimension. It supports inputs with both width and height in the range $[14, 1008]$ as long as both axes are divisible by 14. We have found that summarization tokens work best at $H=W=378$ (although the range $[192, 448]$ works well). For spatial tasks, we used $H=W=518$ to perform linear probing for semantic segmentation, and may perform better for more high - resolution tasks. Going up to $1008$, the model may need additional fine tuning at that resolution for best results. It is not required that $H=W$ although we have not specifically trained or testing the model in this setting.

📚 Documentation

Model Overview

Authors: Mike Ranzinger, Greg Heinrich, Jan Kautz, Pavlo Molchanov

This model performs visual feature extraction. For instance, RADIO generates image embeddings that can be used by a downstream model to classify images. This model is for research and development only. NVIDIA Research

References

Model Architecture

Property	Details
Architecture Type	Neural Network
Network Architecture	Vision Transformer

Input

Property	Details
Input Type(s)	Image
Input Format(s)	Red, Green, Blue (RGB)
Input Parameters	Two Dimensional (2D)
Other Properties Related to Input	Image resolutions up to 2048x2028 in increments of 16 pixels

Output

Property	Details
Output Type(s)	Embeddings
Output Format	Tensor
Output Parameters	2D
Other Properties Related to Output	Downstream model required to leverage image features

Software Integration

Runtime Engine(s):

TAO - 24.10

Supported Hardware Microarchitecture Compatibility:

NVIDIA Ampere
NVIDIA Blackwell
NVIDIA Jetson
NVIDIA Hopper
NVIDIA Lovelace
NVIDIA Pascal
NVIDIA Turing
NVIDIA Volta

[Preferred/Supported] Operating System(s):

Linux
Linux 4 Tegra
QNX
Windows

Pretrained Models

Refer to model_results.csv for model versions and their metrics. Link: https://huggingface.co/collections/nvidia/radio - 669f77f1dd6b153f007dd1c6

Training, Testing, and Evaluation Datasets

Training Dataset

Property	Details
Link	https://www.datacomp.ai/
Data Collection Method	Automated
Labeling Method	Not Applicable (no labels are needed)
Properties	12.8 billion diverse images gathered from the Internet using Common Crawl

Evaluation Dataset

Property	Details
Link	ImageNet
Data Collection Method	Automated
Labeling Method	Human
Properties	This dataset spans 1000 object classes and contains 1,281,167 training images, 50,000 validation images and 100,000 test images.

Inference

Property	Details
Engine	PyTorch
Test Hardware	A100

Citing RADIO

If you find this repository useful, please consider giving a star and citation:

@InProceedings{Ranzinger_2024_CVPR,
    author    = {Ranzinger, Mike and Heinrich, Greg and Kautz, Jan and Molchanov, Pavlo},
    title     = {AM - RADIO: Agglomerative Vision Foundation Model Reduce All Domains Into One},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2024},
    pages     = {12490 - 12500}
}

@misc{ranzinger2024phisdistributionbalancinglabelfree,
      title={PHI - S: Distribution Balancing for Label - Free Multi - Teacher Distillation}, 
      author={Mike Ranzinger and Jon Barker and Greg Heinrich and Pavlo Molchanov and Bryan Catanzaro and Andrew Tao},
      year={2024},
      eprint={2410.01680},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2410.01680}, 
}

Ethical Considerations (For NVIDIA Models Only)

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

📄 License

RADIO code and weights are released under the NSCLv1 License.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご