G

Groupvit Gcc Yfcc

Developed by nvidia
GroupViT is a vision-language model capable of performing zero-shot semantic segmentation on any given vocabulary category.
Downloads 3,473
Release Time : 6/21/2022

Model Overview

Inspired by CLIP, GroupViT is a vision-language model that groups image regions and performs semantic segmentation through text-supervised learning, enabling zero-shot transfer without pixel-level annotations.

Model Features

Zero-shot semantic segmentation
Learns semantic segmentation through text supervision without requiring pixel-level annotations.
Hierarchical grouping mechanism
Gradually groups image regions into larger, arbitrary-shaped segments through a hierarchical grouping vision transformer.
Text-supervised learning
Jointly trains visual and text encoders with contrastive loss on large-scale image-text datasets.

Model Capabilities

Image semantic segmentation
Zero-shot transfer learning
Vision-language understanding

Use Cases

Computer Vision
Semantic segmentation
Performs semantic segmentation of objects in images
Achieves 52.3% mIoU on PASCAL VOC 2012 and 22.4% mIoU on PASCAL Context
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase