Open-source GroupViT Vision-Language Model - Free Deployment for Zero-shot Semantic Segmentation

Groupvit Gcc Yfcc

Developed by nvidia

GroupViT is a vision-language model capable of performing zero-shot semantic segmentation on any given vocabulary category.

Text-to-Image

Transformers

#Zero-shot semantic segmentation #Vision-language model #Text-supervised learning

Downloads 3,473

Release Time : 6/21/2022

Model Overview

Inspired by CLIP, GroupViT is a vision-language model that groups image regions and performs semantic segmentation through text-supervised learning, enabling zero-shot transfer without pixel-level annotations.

Model Features

Zero-shot semantic segmentation

Learns semantic segmentation through text supervision without requiring pixel-level annotations.

Hierarchical grouping mechanism

Gradually groups image regions into larger, arbitrary-shaped segments through a hierarchical grouping vision transformer.

Text-supervised learning

Jointly trains visual and text encoders with contrastive loss on large-scale image-text datasets.

Model Capabilities

Image semantic segmentation

Zero-shot transfer learning

Vision-language understanding

Use Cases

Computer Vision

Semantic segmentation

Performs semantic segmentation of objects in images

Achieves 52.3% mIoU on PASCAL VOC 2012 and 22.4% mIoU on PASCAL Context

🚀 Model Card: GroupViT

GroupViT is a vision - language model that can perform zero - shot semantic segmentation on any given vocabulary categories, inspired by CLIP.

🚀 Quick Start

This checkpoint is uploaded by Jiarui Xu.

✨ Features

The GroupViT model was proposed in GroupViT: Semantic Segmentation Emerges from Text Supervision by Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, Xiaolong Wang. Inspired by CLIP, it's a vision - language model capable of zero - shot semantic segmentation on any given vocabulary categories.

Model Date

June 2022

Abstract

Grouping and recognition are crucial for visual scene understanding, like in object detection and semantic segmentation. In end - to - end deep learning systems, image region grouping often occurs implicitly through top - down supervision from pixel - level recognition labels. This paper proposes to re - introduce the grouping mechanism into deep networks, enabling semantic segments to emerge automatically with only text supervision. A hierarchical Grouping Vision Transformer (GroupViT) is proposed, which goes beyond the regular grid structure representation and learns to group image regions into larger arbitrary - shaped segments. GroupViT is jointly trained with a text encoder on a large - scale image - text dataset via contrastive losses. With only text supervision and no pixel - level annotations, GroupViT can group semantic regions and transfer to semantic segmentation in a zero - shot manner. It achieves a zero - shot accuracy of 52.3% mIoU on the PASCAL VOC 2012 and 22.4% mIoU on PASCAL Context datasets, competing with state - of - the - art transfer - learning methods requiring more supervision.

Documents

GroupViT Paper

💻 Usage Examples

Basic Usage

from PIL import Image
import requests
from transformers import AutoProcessor, GroupViTModel

model = GroupViTModel.from_pretrained("nvidia/groupvit-gcc-yfcc")
processor = AutoProcessor.from_pretrained("nvidia/groupvit-gcc-yfcc")

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True)

outputs = model(**inputs)
logits_per_image = outputs.logits_per_image  # this is the image-text similarity score
probs = logits_per_image.softmax(dim=1)  # we can take the softmax to get the label probabilities

Advanced Usage

For more code examples, we refer to the documentation.

🔧 Technical Details

Data

The model was trained on publicly available image - caption data. This was achieved by crawling several websites and using pre - existing image datasets like YFCC100M. A large part of the data comes from internet crawling. This implies that the data is more representative of people and societies highly connected to the internet, which often leans towards more developed nations and younger, male users.

BibTeX entry and citation info

@article{xu2022groupvit,
  author    = {Xu, Jiarui and De Mello, Shalini and Liu, Sifei and Byeon, Wonmin and Breuel, Thomas and Kautz, Jan and Wang, Xiaolong},
  title     = {GroupViT: Semantic Segmentation Emerges from Text Supervision},
  journal   = {arXiv preprint arXiv:2202.11094},
  year      = {2022},
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご