đ C-RADIO: Visual Feature Extraction Model
This model specializes in visual feature extraction. For example, RADIO can generate image embeddings that downstream models can utilize for image classification, offering significant value in the field of computer vision.
đ Quick Start
Installation
First, ensure you have the necessary libraries installed. You can install them using the following command:
pip install torch transformers pillow einops
Usage Example
import torch
from PIL import Image
from transformers import AutoModel, CLIPImageProcessor
hf_repo = "nvidia/C-RADIO"
image_processor = CLIPImageProcessor.from_pretrained(hf_repo)
model = AutoModel.from_pretrained(hf_repo, trust_remote_code=True)
model.eval().cuda()
image = Image.open('./assets/radio.png').convert('RGB')
pixel_values = image_processor(images=image, return_tensors='pt', do_resize=True).pixel_values
pixel_values = pixel_values.cuda()
summary, features = model(pixel_values)
Advanced Usage
from einops import rearrange
spatial_features = rearrange(spatial_features, 'b (h w) d -> b d h w', h=x.shape[-2] // patch_size, w=x.shape[-1] // patch_size)
⨠Features
- Feature Extraction: RADIO can generate image embeddings for downstream models to perform image classification.
- Flexible Output: It returns a tuple containing two tensors,
summary
and spatial_features
, suitable for different tasks.
đĻ Installation
You can install the required libraries using pip
:
pip install torch transformers pillow einops
đģ Usage Examples
Basic Usage
import torch
from PIL import Image
from transformers import AutoModel, CLIPImageProcessor
hf_repo = "nvidia/C-RADIO"
image_processor = CLIPImageProcessor.from_pretrained(hf_repo)
model = AutoModel.from_pretrained(hf_repo, trust_remote_code=True)
model.eval().cuda()
image = Image.open('./assets/radio.png').convert('RGB')
pixel_values = image_processor(images=image, return_tensors='pt', do_resize=True).pixel_values
pixel_values = pixel_values.cuda()
summary, features = model(pixel_values)
Advanced Usage
from einops import rearrange
spatial_features = rearrange(spatial_features, 'b (h w) d -> b d h w', h=x.shape[-2] // patch_size, w=x.shape[-1] // patch_size)
đ Documentation
Model Overview
This model performs visual feature extraction. For instance, RADIO generates image embeddings that can be used by a downstream model to classify images.
Model Architecture
Property |
Details |
Architecture Type |
Neural Network |
Network Architecture |
Vision Transformer |
Input
Property |
Details |
Input Type(s) |
Image |
Input Format(s) |
Red, Green, Blue (RGB) pixel values in [0, 1] range. |
Input Parameters |
Two Dimensional (2D) |
Other Properties Related to Input |
Image resolutions up to 2048x2028 in increments of 16 pixels |
Output
Property |
Details |
Output Type(s) |
Embeddings |
Output Format |
Tensor |
Output Parameters |
2D |
Other Properties Related to Output |
Downstream model required to leverage image features |
Usage
RADIO will return a tuple with two tensors. The summary
is similar to the cls_token
in ViT and is meant to represent the general concept of the entire image. It has shape (B,C)
with B
being the batch dimension, and C
being some number of channels. The spatial_features
represent more localized content which should be suitable for dense tasks such as semantic segmentation, or for integration into an LLM.
Software Integration
Property |
Details |
Runtime Engine(s) |
TAO - 24.10 |
Supported Hardware Microarchitecture Compatibility |
NVIDIA Ampere, NVIDIA Blackwell, NVIDIA Jetson, NVIDIA Hopper, NVIDIA Lovelace, NVIDIA Pascal, NVIDIA Turing, NVIDIA Volta |
[Preferred/Supported] Operating System(s) |
Linux, Linux 4 Tegra, QNX, Windows |
Training, Testing, and Evaluation Datasets
Training Dataset
Property |
Details |
Dataset Name |
NV - CC - Img - Text - Dataset |
Data Collection Method |
Automated |
Labeling Method |
Not Applicable (no labels are needed) |
Properties |
700 Million Images |
Evaluation Dataset
Property |
Details |
Link |
ImageNet |
Data Collection Method |
Automated |
Labeling Method |
Human |
Properties |
This dataset spans 1000 object classes and contains 1,281,167 training images, 50,000 validation images and 100,000 test images. |
Inference
Property |
Details |
Engine |
PyTorch |
Test Hardware |
A100 |
Ethical Considerations (For NVIDIA Models Only)
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. Users should evaluate the model for safety and quality for a specific use case and build additional guardrails as appropriate.
Please report security vulnerabilities or NVIDIA AI Concerns here.
đ§ Technical Details
This model is based on the Vision Transformer architecture, which is a powerful neural network architecture for processing images. It takes RGB images as input and outputs image embeddings in tensor format. The input images should have pixel values in the [0, 1] range and can have resolutions up to 2048x2028 in increments of 16 pixels. The output embeddings can be used by downstream models for various tasks such as image classification.
đ License
This model is governed by the NVIDIA Open Model License Agreement.
References