C-RADIO Open-Source Visual Feature Extraction Model - Freely Generate Image Embeddings to Boost Image Classification

C RADIO

Developed by nvidia

A visual feature extraction model developed by NVIDIA for generating image embeddings, supporting downstream tasks such as image classification.

Transformers

Open Source License:Other #Visual Feature Extraction #Multi-Hardware Compatibility #High-Resolution Processing

Downloads 398

Release Time : 5/29/2024

Model Overview

C-RADIO is a vision transformer model focused on extracting features from images to generate embeddings for downstream tasks.

Model Features

Efficient Visual Feature Extraction

Capable of extracting both global and local features from images, suitable for various computer vision tasks.

High-Resolution Support

Supports image inputs up to 2048x2028 resolution with 16-pixel increments.

Multi-Hardware Compatibility

Compatible with multiple NVIDIA hardware architectures including Ampere, Blackwell, and Jetson.

Model Capabilities

Image Feature Extraction

Generating Image Embeddings

Supporting Downstream Vision Tasks

Use Cases

Computer Vision

Image Classification

Using the model's extracted image embeddings for image classification tasks.

Semantic Segmentation

Utilizing the model's spatial features for dense prediction tasks like semantic segmentation.

🚀 C-RADIO: Visual Feature Extraction Model

This model specializes in visual feature extraction. For example, RADIO can generate image embeddings that downstream models can utilize for image classification, offering significant value in the field of computer vision.

🚀 Quick Start

Installation

First, ensure you have the necessary libraries installed. You can install them using the following command:

pip install torch transformers pillow einops

Usage Example

import torch
from PIL import Image
from transformers import AutoModel, CLIPImageProcessor

hf_repo = "nvidia/C-RADIO"

image_processor = CLIPImageProcessor.from_pretrained(hf_repo)
model = AutoModel.from_pretrained(hf_repo, trust_remote_code=True)
model.eval().cuda()

image = Image.open('./assets/radio.png').convert('RGB')
pixel_values = image_processor(images=image, return_tensors='pt', do_resize=True).pixel_values
pixel_values = pixel_values.cuda()

summary, features = model(pixel_values)

Advanced Usage

from einops import rearrange
spatial_features = rearrange(spatial_features, 'b (h w) d -> b d h w', h=x.shape[-2] // patch_size, w=x.shape[-1] // patch_size)

✨ Features

Feature Extraction: RADIO can generate image embeddings for downstream models to perform image classification.
Flexible Output: It returns a tuple containing two tensors, summary and spatial_features, suitable for different tasks.

📦 Installation

You can install the required libraries using pip:

pip install torch transformers pillow einops

💻 Usage Examples

Basic Usage

import torch
from PIL import Image
from transformers import AutoModel, CLIPImageProcessor

hf_repo = "nvidia/C-RADIO"

image_processor = CLIPImageProcessor.from_pretrained(hf_repo)
model = AutoModel.from_pretrained(hf_repo, trust_remote_code=True)
model.eval().cuda()

image = Image.open('./assets/radio.png').convert('RGB')
pixel_values = image_processor(images=image, return_tensors='pt', do_resize=True).pixel_values
pixel_values = pixel_values.cuda()

summary, features = model(pixel_values)

Advanced Usage

from einops import rearrange
spatial_features = rearrange(spatial_features, 'b (h w) d -> b d h w', h=x.shape[-2] // patch_size, w=x.shape[-1] // patch_size)

📚 Documentation

Model Overview

This model performs visual feature extraction. For instance, RADIO generates image embeddings that can be used by a downstream model to classify images.

Model Architecture

Property	Details
Architecture Type	Neural Network
Network Architecture	Vision Transformer

Input

Property	Details
Input Type(s)	Image
Input Format(s)	Red, Green, Blue (RGB) pixel values in [0, 1] range.
Input Parameters	Two Dimensional (2D)
Other Properties Related to Input	Image resolutions up to 2048x2028 in increments of 16 pixels

Output

Property	Details
Output Type(s)	Embeddings
Output Format	Tensor
Output Parameters	2D
Other Properties Related to Output	Downstream model required to leverage image features

Usage

RADIO will return a tuple with two tensors. The summary is similar to the cls_token in ViT and is meant to represent the general concept of the entire image. It has shape (B,C) with B being the batch dimension, and C being some number of channels. The spatial_features represent more localized content which should be suitable for dense tasks such as semantic segmentation, or for integration into an LLM.

Software Integration

Property	Details
Runtime Engine(s)	TAO - 24.10
Supported Hardware Microarchitecture Compatibility	NVIDIA Ampere, NVIDIA Blackwell, NVIDIA Jetson, NVIDIA Hopper, NVIDIA Lovelace, NVIDIA Pascal, NVIDIA Turing, NVIDIA Volta
[Preferred/Supported] Operating System(s)	Linux, Linux 4 Tegra, QNX, Windows

Training, Testing, and Evaluation Datasets

Training Dataset

Property	Details
Dataset Name	NV - CC - Img - Text - Dataset
Data Collection Method	Automated
Labeling Method	Not Applicable (no labels are needed)
Properties	700 Million Images

Evaluation Dataset

Property	Details
Link	ImageNet
Data Collection Method	Automated
Labeling Method	Human
Properties	This dataset spans 1000 object classes and contains 1,281,167 training images, 50,000 validation images and 100,000 test images.

Inference

Property	Details
Engine	PyTorch
Test Hardware	A100

Ethical Considerations (For NVIDIA Models Only)

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. Users should evaluate the model for safety and quality for a specific use case and build additional guardrails as appropriate.

Please report security vulnerabilities or NVIDIA AI Concerns here.

🔧 Technical Details

This model is based on the Vision Transformer architecture, which is a powerful neural network architecture for processing images. It takes RGB images as input and outputs image embeddings in tensor format. The input images should have pixel values in the [0, 1] range and can have resolutions up to 2048x2028 in increments of 16 pixels. The output embeddings can be used by downstream models for various tasks such as image classification.

📄 License

This model is governed by the NVIDIA Open Model License Agreement.

References

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご