O

Owlvit Large Patch14

Developed by google
OWL-ViT is a zero-shot text-conditioned object detection model that can retrieve objects in images through text queries.
Downloads 25.01k
Release Time : 7/5/2022

Model Overview

OWL-ViT uses CLIP as a multimodal backbone network, combining vision transformers and text encoders to achieve open-vocabulary object detection.

Model Features

Zero-shot Detection Capability
Detects new objects without category-specific training, requiring only text descriptions to perform detection tasks.
Multimodal Architecture
Combines vision transformers and text encoders for joint understanding of images and text.
Open-vocabulary Classification
Supports recognition of arbitrary text-described categories by dynamically replacing classification layer weights.

Model Capabilities

Text-conditioned Object Detection
Open-vocabulary Object Recognition
Multimodal Image Understanding

Use Cases

Computer Vision Research
Zero-shot Object Detection Research
Explore the model's detection capability on unseen categories.
Interdisciplinary Applications
Special Object Recognition
Identify rare objects in fields like healthcare and industrial applications that are uncommon in training data.
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
Š 2025AIbase