RADIO-B Open-Source Visual Foundation Model - Unifying Visual Information Representation for Multiple Visual Tasks

RADIO B

Developed by nvidia

RADIO is a vision foundation model developed by NVIDIA Research, capable of unifying visual information across different domains for various vision tasks.

Image Segmentation

Transformers

#Multimodal Visual Representation #Dense Semantic Segmentation #Cross-domain Unified Modeling

Downloads 999

Release Time : 7/23/2024

Model Overview

RADIO is a vision foundation model that generates both holistic conceptual representations and localized content representations of images, suitable for dense tasks like semantic segmentation or integration with large language models.

Model Features

Unified Representation

Capable of unifying visual information across different domains, achieving cross-domain consistency.

Dual Output

Simultaneously outputs holistic conceptual representations and localized content representations of images, suitable for various downstream tasks.

Efficient Downsampling

Achieves efficient spatial feature extraction through 14x14 patch size.

Model Capabilities

Holistic Image Conceptual Representation

Localized Content Representation

Semantic Segmentation

Vision-Language Model Integration

Use Cases

Computer Vision

Semantic Segmentation

Utilizes the model's spatial features for pixel-level classification

Vision-Language Integration

Combines image representations with large language models for multimodal understanding

🚀 AM-RADIO: Reduce All Domains Into One

AM-RADIO is a vision foundation model that aims to reduce all domains into one. It provides a unified approach for handling various visual tasks.

Authors: Mike Ranzinger, Greg Heinrich, Jan Kautz, Pavlo Molchanov

Affiliation: NVIDIA Research

🚀 Quick Start

HuggingFace Hub

You can pull the model from a Python script:

import torch
from PIL import Image
from transformers import AutoModel, CLIPImageProcessor

hf_repo = "nvidia/RADIO-B"

image_processor = CLIPImageProcessor.from_pretrained(hf_repo)
model = AutoModel.from_pretrained(hf_repo, trust_remote_code=True)
model.eval().cuda()

image = Image.open('./assets/radio.png').convert('RGB')
pixel_values = image_processor(images=image, return_tensors='pt', do_resize=True).pixel_values
pixel_values = pixel_values.cuda()

summary, features = model(pixel_values)

💻 Usage Examples

Basic Usage

RADIO will return a tuple with two tensors. The summary is similar to the cls_token in ViT and is meant to represent the general concept of the entire image. It has shape $(B,C)$ with $B$ being the batch dimension, and $C$ being some number of channels. The spatial_features represent more localized content which should be suitable for dense tasks such as semantic segmentation, or for integration into an LLM. It has shape $(B,T,D)$ with $T$ being the flattened spatial tokens, and $D$ being the channels for spatial features. Note that $C \neq D$ in general.

Converting to a spatial tensor format can be done using the downsampling size of the model, combined with the input tensor shape. For 'radio_v1', the patch size is 14.

from einops import rearrange
spatial_features = rearrange(spatial_features, 'b (h w) d -> b d h w', h=x.shape[-2] // patch_size, w=x.shape[-1] // patch_size)

The resulting tensor will have shape $(B,D,H,W)$, as is typically seen with computer vision models.

📚 Documentation

RADIOv2.5 Notes

See the RADIOv2.5 technical report.

📄 License

RADIO code and weights are released under the NSCLv1 License.

Citing RADIO

If you find this repository useful, please consider giving a star and citation:

@InProceedings{Ranzinger_2024_CVPR,
    author    = {Ranzinger, Mike and Heinrich, Greg and Kautz, Jan and Molchanov, Pavlo},
    title     = {AM-RADIO: Agglomerative Vision Foundation Model Reduce All Domains Into One},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2024},
    pages     = {12490-12500}
}

@misc{ranzinger2024phisdistributionbalancinglabelfree,
      title={PHI-S: Distribution Balancing for Label-Free Multi-Teacher Distillation}, 
      author={Mike Ranzinger and Jon Barker and Greg Heinrich and Pavlo Molchanov and Bryan Catanzaro and Andrew Tao},
      year={2024},
      eprint={2410.01680},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2410.01680}, 
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご