C-RADIOv2-g Open-source Visual Feature Extraction Model - Multiple Specifications Empower Image Understanding and Dense Processing

C RADIOv2 G

Developed by nvidia

C-RADIOv2 is a visual feature extraction model developed by NVIDIA, offering multiple specification versions suitable for image understanding and dense processing tasks.

Transformers

Open Source License:Other #Multi-scale Visual Embedding #High-resolution Processing #Downstream Task Adaptation

Downloads 648

Release Time : 1/17/2025

Model Overview

This model is designed for visual feature extraction tasks, generating image embeddings that can be utilized by downstream models for applications such as image classification and semantic segmentation.

Model Features

Multiple Specification Versions

Offers four specifications: Base, Large, Giant, and Super Giant to meet diverse computational needs

Efficient Feature Extraction

Generates both global and local image embeddings, suitable for image-level understanding and dense processing tasks

High-resolution Support

Supports resolutions in 16-pixel increments, up to 2048x2028

Data Balancing Techniques

Employs inverse frequency sampling and PHI normalization techniques to optimize training data distribution

Model Capabilities

Image Feature Extraction

Image Classification

Semantic Segmentation

Depth Estimation

Vision-Language Model Integration

Use Cases

Computer Vision

Image Classification

Utilize image embeddings extracted by the model for image classification tasks

Semantic Segmentation

Leverage the model's spatial features for pixel-level semantic segmentation

Multimodal Applications

Vision-Language Models

Integrate image features into large language models

🚀 RADIO Image Feature Extraction Model

This model is designed for visual feature extraction, generating image embeddings that can be utilized by downstream models for tasks like image classification.

🚀 Quick Start

This model can generate image embeddings for downstream applications. Here's a basic example of how to use it:

import torch
from PIL import Image
from transformers import AutoModel, CLIPImageProcessor

hf_repo = "nvidia/C-RADIOv2-g"

image_processor = CLIPImageProcessor.from_pretrained(hf_repo)
model = AutoModel.from_pretrained(hf_repo, trust_remote_code=True)
model.eval().cuda()

image = Image.open('./assets/radio.png').convert('RGB')
pixel_values = image_processor(images=image, return_tensors='pt', do_resize=True).pixel_values
pixel_values = pixel_values.cuda()

summary, features = model(pixel_values)

✨ Features

Multiple Sizes: C - RADIOv2 models are available in multiple sizes, including Base (90M parameters), Large (320M parameters), Huge (653M parameters), and Gigantic (1.1B parameters).
Enhanced Training: C - RADIOv2 was trained for 1M steps (400k more steps than v1), using inverse frequency sampling for data balancing and PHI Standardization for teacher distribution balancing.
Commercial Use: This model is ready for commercial/non - commercial use.

📦 Installation

To use this model, you need to install the transformers library. You can install it using the following command:

pip install transformers

💻 Usage Examples

Basic Usage

import torch
from PIL import Image
from transformers import AutoModel, CLIPImageProcessor

hf_repo = "nvidia/C-RADIOv2-g"

image_processor = CLIPImageProcessor.from_pretrained(hf_repo)
model = AutoModel.from_pretrained(hf_repo, trust_remote_code=True)
model.eval().cuda()

image = Image.open('./assets/radio.png').convert('RGB')
pixel_values = image_processor(images=image, return_tensors='pt', do_resize=True).pixel_values
pixel_values = pixel_values.cuda()

summary, features = model(pixel_values)

Advanced Usage

from einops import rearrange
spatial_features = rearrange(spatial_features, 'b (h w) d -> b d h w', h=x.shape[-2] // patch_size, w=x.shape[-1] // patch_size)

📚 Documentation

Model Overview

[Github] [CVPR 2025] [CVPR 2024]

Description

This model performs visual feature extraction. For instance, RADIO generates image embeddings that can be used by a downstream model to classify images.

Deployment Geography

Global.

Use Case

The embeddings generated by this model are expected to be used by a downstream application, such as image - level understanding (image classification, curation, etc.), dense processing (semantic segmentation, depth estimation, etc.), and integration into a Vision - Language Model.

Release Date

Huggingface: 03/26/2025 via RADIO Collection of Models.

References

Model Architecture

Property	Details
Model Type	Vision Transformer
Architecture Type	Neural Network
Network Architecture	Vision Transformer

Input

Property	Details
Input Type(s)	Image
Input Format(s)	Red, Green, Blue (RGB)
Input Parameters	Two Dimensional (2D)
Other Properties Related to Input	Image resolutions up to 2048x2028 in increments of 16 pixels

Output

Property	Details
Output Type(s)	Embeddings
Output Format	Tensor
Output Parameters	2D
Other Properties Related to Output	Downstream model required to leverage image features

Software Integration

Property	Details
Runtime Engine(s)	TAO - 24.10
Supported Hardware Microarchitecture Compatibility	NVIDIA Ampere, NVIDIA Blackwell, NVIDIA Jetson, NVIDIA Hopper, NVIDIA Lovelace, NVIDIA Pascal, NVIDIA Turing, NVIDIA Volta
[Preferred/Supported] Operating System(s)	Linux, Linux 4 Tegra, QNX, Windows

Model Version(s)

C - RADIOv2 - B (90M parameters).
C - RADIOv2 - L (320M parameters).
C - RADIOv2 - H (653M parameters).
C - RADIOv2 - G (1.8B parameters).

Links:

https://huggingface.co/nvidia/C - RADIOv2 - B
https://huggingface.co/nvidia/C - RADIOv2 - L
https://huggingface.co/nvidia/C - RADIOv2 - H
https://huggingface.co/nvidia/C - RADIOv2 - g

Training and Evaluation Datasets

Training Dataset

Dataset Name: NV - CC - Img - Text - Dataset
Data Collection Method: Automated
Labeling Method: Not Applicable (no labels are needed)
Properties: 700 Million Images

Evaluation Dataset

Link: ImageNet
Data Collection Method: Automated
Labeling Method: Human
Properties: This dataset spans 1000 object classes and contains 1,281,167 training images, 50,000 validation images and 100,000 test images.

Inference

Property	Details
Engine	PyTorch
Test Hardware	A100

Ethical Considerations

NVIDIA believes Trustworthy AI is a shared responsibility and has established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with the terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards below.

Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en - us/support/submit - security - vulnerability/).

Bias

Field	Response
Participation considerations from adversely impacted groups [protected classes](https://www.senate.ca.gov/content/protected - classes) in model design and testing	None
Measures taken to mitigate against unwanted bias	None

Explainability

Field	Response
Intended Application & Domain	Visual Feature Extraction
Model Type	Vision Transformer
Intended Users	Developers of downstream vision applications
Output	Image embeddings
Describe how the model works	The model takes an image as input, processes the image through multiple transformer blocks, and outputs summary and patch embeddings.
Name the adversely impacted groups this has been tested to deliver comparable outcomes regardless of	Not Applicable
Technical Limitations	This model generates image embeddings that can be used by a downstream model to, for example, classify images. The downstream model must be trained to leverage the visual embeddings.
Verified to have met prescribed NVIDIA quality standards	Yes
Performance Metrics	Image classification accuracy, semantic segmentation mean - over - intersection.
Potential Known Risks	This model is only tested on input resolutions ranging from 256 to 2048, in increments of 16 pixels. Additionally, the generated embeddings might fail to disambiguate differences that appear evident to humans (e.g. two images showing different breeds of dogs might in fact produce very similar embeddings). Domain - specific evaluation is required for the target application.
Licensing	[NVIDIA Open Model License](https://developer.download.nvidia.com/licenses/nvidia - open - model - license - agreement - june - 2024.pdf)

Privacy

Field	Response
Generatable or reverse engineerable personal data?	None
Personal data used to create this model?	None
How often is dataset reviewed?	Before Every Release
Is there provenance for all datasets used in training?	Yes
Does data labeling (annotation, metadata) comply with privacy laws?	Yes
Is data compliant with data subject requests for data correction or removal, if such a request was made?	Yes

Safety

Field	Response
Model Application(s)	Generation of visual embeddings
Describe the life critical impact (if present).	Not Applicable
Use Case Restrictions	Abide by NVIDIA Open Model License Agreement
Model and dataset restrictions	The Principle of least privilege (PoLP) is applied limiting access for dataset generation and model development. Restrictions enforce dataset access during training, and dataset license constraints adhered to.

📄 License

Use of this model is governed by the [NVIDIA Open Model License Agreement](https://developer.download.nvidia.com/licenses/nvidia - open - model - license - agreement - june - 2024.pdf).

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご