🚀 Model Card: GroupViT
GroupViT is a vision - language model that can perform zero - shot semantic segmentation on any given vocabulary categories, inspired by CLIP.
🚀 Quick Start
This checkpoint is uploaded by Jiarui Xu.
✨ Features
The GroupViT model was proposed in GroupViT: Semantic Segmentation Emerges from Text Supervision by Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, Xiaolong Wang. Inspired by CLIP, it's a vision - language model capable of zero - shot semantic segmentation on any given vocabulary categories.
Model Date
June 2022
Abstract
Grouping and recognition are crucial for visual scene understanding, like in object detection and semantic segmentation. In end - to - end deep learning systems, image region grouping often occurs implicitly through top - down supervision from pixel - level recognition labels. This paper proposes to re - introduce the grouping mechanism into deep networks, enabling semantic segments to emerge automatically with only text supervision. A hierarchical Grouping Vision Transformer (GroupViT) is proposed, which goes beyond the regular grid structure representation and learns to group image regions into larger arbitrary - shaped segments. GroupViT is jointly trained with a text encoder on a large - scale image - text dataset via contrastive losses. With only text supervision and no pixel - level annotations, GroupViT can group semantic regions and transfer to semantic segmentation in a zero - shot manner. It achieves a zero - shot accuracy of 52.3% mIoU on the PASCAL VOC 2012 and 22.4% mIoU on PASCAL Context datasets, competing with state - of - the - art transfer - learning methods requiring more supervision.
Documents
💻 Usage Examples
Basic Usage
from PIL import Image
import requests
from transformers import AutoProcessor, GroupViTModel
model = GroupViTModel.from_pretrained("nvidia/groupvit-gcc-yfcc")
processor = AutoProcessor.from_pretrained("nvidia/groupvit-gcc-yfcc")
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
inputs = processor(text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True)
outputs = model(**inputs)
logits_per_image = outputs.logits_per_image
probs = logits_per_image.softmax(dim=1)
Advanced Usage
For more code examples, we refer to the documentation.
🔧 Technical Details
Data
The model was trained on publicly available image - caption data. This was achieved by crawling several websites and using pre - existing image datasets like YFCC100M. A large part of the data comes from internet crawling. This implies that the data is more representative of people and societies highly connected to the internet, which often leans towards more developed nations and younger, male users.
BibTeX entry and citation info
@article{xu2022groupvit,
author = {Xu, Jiarui and De Mello, Shalini and Liu, Sifei and Byeon, Wonmin and Breuel, Thomas and Kautz, Jan and Wang, Xiaolong},
title = {GroupViT: Semantic Segmentation Emerges from Text Supervision},
journal = {arXiv preprint arXiv:2202.11094},
year = {2022},
}