RADIO-B開源視覺基礎模型 - 統一表徵視覺信息，多視覺任務適用

首頁

RADIO B

由nvidia開發

RADIO是由NVIDIA研究院開發的視覺基礎模型，能夠將不同領域的視覺信息統一表徵，適用於多種視覺任務。

圖像分割

Transformers

#多模態視覺表徵 #密集語義分割 #跨域統一建模

下載量 999

發布時間 : 7/23/2024

模型概述

RADIO是一個視覺基礎模型，能夠生成圖像的整體概念表徵和局部內容表徵，適用於語義分割等密集任務或與大型語言模型集成。

模型特點

統一表徵

能夠將不同領域的視覺信息統一表徵，實現萬域歸一。

雙輸出

同時輸出圖像整體概念表徵和局部內容表徵，適用於多種下游任務。

高效下采樣

通過14x14的補丁尺寸實現高效的空間特徵提取。

模型能力

圖像整體概念表徵

局部內容表徵

語義分割

視覺-語言模型集成

使用案例

計算機視覺

語義分割

利用模型輸出的空間特徵進行像素級分類

視覺-語言集成

將圖像表徵與大型語言模型結合，實現多模態理解

🚀 AM - RADIO：將所有領域歸為一體

AM - RADIO是一種創新的模型，它能夠將不同領域的信息進行整合，實現多領域信息的統一處理，為計算機視覺等領域的研究和應用提供了新的思路和方法。

🚀 快速開始

你可以從HuggingFace Hub拉取模型並使用，以下是具體的Python代碼示例：

import torch
from PIL import Image
from transformers import AutoModel, CLIPImageProcessor

hf_repo = "nvidia/RADIO-B"

image_processor = CLIPImageProcessor.from_pretrained(hf_repo)
model = AutoModel.from_pretrained(hf_repo, trust_remote_code=True)
model.eval().cuda()

image = Image.open('./assets/radio.png').convert('RGB')
pixel_values = image_processor(images=image, return_tensors='pt', do_resize=True).pixel_values
pixel_values = pixel_values.cuda()

summary, features = model(pixel_values)

💻 使用示例

基礎用法

RADIO會返回一個包含兩個張量的元組。summary類似於ViT中的cls_token，用於表示整個圖像的一般概念。它的形狀為$(B,C)$，其中$B$是批量維度，$C$是通道數。spatial_features表示更局部化的內容，適用於語義分割等密集任務，或集成到大型語言模型中。它的形狀為$(B,T,D)$，其中$T$是扁平化的空間令牌，$D$是空間特徵的通道數。一般來說，$C \neq D$。

import torch
from PIL import Image
from transformers import AutoModel, CLIPImageProcessor

hf_repo = "nvidia/RADIO-B"

image_processor = CLIPImageProcessor.from_pretrained(hf_repo)
model = AutoModel.from_pretrained(hf_repo, trust_remote_code=True)
model.eval().cuda()

image = Image.open('./assets/radio.png').convert('RGB')
pixel_values = image_processor(images=image, return_tensors='pt', do_resize=True).pixel_values
pixel_values = pixel_values.cuda()

summary, features = model(pixel_values)

高級用法

將spatial_features轉換為空間張量格式可以使用模型的下采樣大小，並結合輸入張量的形狀。對於'radio_v1'，補丁大小為14。

from einops import rearrange
spatial_features = rearrange(spatial_features, 'b (h w) d -> b d h w', h=x.shape[-2] // patch_size, w=x.shape[-1] // patch_size)

得到的張量形狀為$(B,D,H,W)$，這是計算機視覺模型中常見的形狀。

📚 詳細文檔

📄 許可證

RADIO代碼和權重根據NSCLv1許可證發佈。

📚 引用

如果你發現這個倉庫很有用，請考慮給它加星並引用：

@InProceedings{Ranzinger_2024_CVPR,
    author    = {Ranzinger, Mike and Heinrich, Greg and Kautz, Jan and Molchanov, Pavlo},
    title     = {AM - RADIO: Agglomerative Vision Foundation Model Reduce All Domains Into One},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2024},
    pages     = {12490-12500}
}

@misc{ranzinger2024phisdistributionbalancinglabelfree,
      title={PHI - S: Distribution Balancing for Label - Free Multi - Teacher Distillation}, 
      author={Mike Ranzinger and Jon Barker and Greg Heinrich and Pavlo Molchanov and Bryan Catanzaro and Andrew Tao},
      year={2024},
      eprint={2410.01680},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2410.01680}, 
}