Open-source OWL-ViT model - Supports zero-shot detection and enables easy recognition of image objects through text queries

Home

Owlvit Base Patch16

Developed by google

OWL-ViT is a zero-shot text-conditioned object detection model that can detect objects in images via text queries.

Text-to-Image

Transformers

Open Source License:Apache-2.0 #Zero-shot Object Detection #Open-vocabulary Recognition #Multimodal Vision Model

Downloads 4,588

Release Time : 7/5/2022

Model Overview

OWL-ViT is a zero-shot text-conditioned object detection model based on a CLIP backbone, capable of detecting objects in images using one or more text queries without requiring training on specific categories.

Model Features

Zero-shot Detection Capability

Can detect new objects via text queries without training on specific categories

Multi-text Query Support

Supports detecting different objects in an image simultaneously using one or more text queries

Open-vocabulary Classification

Achieves open-vocabulary classification by replacing fixed classification layer weights with text embeddings

Model Capabilities

Zero-shot text-conditioned object detection

Image object localization

Multi-category simultaneous detection

Use Cases

Computer Vision Research

Zero-shot Object Detection Research

Used to study the model's detection capability on unseen categories

Interdisciplinary Applications

Special Object Recognition

Applied in domains requiring recognition of objects with unavailable labels during training

🚀 Model Card: OWL-ViT

OWL-ViT is a zero-shot text-conditioned object detection model that can query an image with one or multiple text queries.

🚀 Quick Start

If you want to use the OWL-ViT model, you can refer to the following code example:

import requests
from PIL import Image
import torch

from transformers import OwlViTProcessor, OwlViTForObjectDetection

processor = OwlViTProcessor.from_pretrained("google/owlvit-base-patch16")
model = OwlViTForObjectDetection.from_pretrained("google/owlvit-base-patch16")

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
texts = [["a photo of a cat", "a photo of a dog"]]
inputs = processor(text=texts, images=image, return_tensors="pt")
outputs = model(**inputs)

# Target image sizes (height, width) to rescale box predictions [batch_size, 2]
target_sizes = torch.Tensor([image.size[::-1]])
# Convert outputs (bounding boxes and class logits) to COCO API
results = processor.post_process_object_detection(outputs=outputs, threshold=0.1, target_sizes=target_sizes)

i = 0  # Retrieve predictions for the first image for the corresponding text queries
text = texts[i]
boxes, scores, labels = results[i]["boxes"], results[i]["scores"], results[i]["labels"]

# Print detected objects and rescaled box coordinates
for box, score, label in zip(boxes, scores, labels):
    box = [round(i, 2) for i in box.tolist()]
    print(f"Detected {text[label]} with confidence {round(score.item(), 3)} at location {box}")

✨ Features

Zero-shot Detection: OWL-ViT can perform zero-shot text-conditioned object detection, allowing you to query an image with one or multiple text queries.
Multi-modal Backbone: It uses CLIP as its multi-modal backbone, with a ViT-like Transformer for visual features and a causal language model for text features.
Open-vocabulary Classification: Enables open-vocabulary classification by replacing fixed classification layer weights with class-name embeddings from the text model.

📚 Documentation

OWL-ViT Paper

🔧 Technical Details

Model Details

The OWL-ViT (short for Vision Transformer for Open-World Localization) was proposed in Simple Open-Vocabulary Object Detection with Vision Transformers by Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, Xiao Wang, Xiaohua Zhai, Thomas Kipf, and Neil Houlsby.

OWL-ViT uses CLIP as its multi-modal backbone, with a ViT-like Transformer to get visual features and a causal language model to get the text features. To use CLIP for detection, OWL-ViT removes the final token pooling layer of the vision model and attaches a lightweight classification and box head to each transformer output token. Open-vocabulary classification is enabled by replacing the fixed classification layer weights with the class-name embeddings obtained from the text model. The authors first train CLIP from scratch and fine-tune it end-to-end with the classification and box heads on standard detection datasets using a bipartite matching loss. One or multiple text queries per image can be used to perform zero-shot text-conditioned object detection.

Model Date

May 2022

Model Type

Property	Details
Model Type	The model uses a CLIP backbone with a ViT-B/16 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss. The CLIP backbone is trained from scratch and fine-tuned together with the box and class prediction heads with an object detection objective.

📦 Data

The CLIP backbone of the model was trained on publicly available image-caption data. This was done through a combination of crawling a handful of websites and using commonly-used pre-existing image datasets such as YFCC100M. A large portion of the data comes from our crawling of the internet. This means that the data is more representative of people and societies most connected to the internet. The prediction heads of OWL-ViT, along with the CLIP backbone, are fine-tuned on publicly available object detection datasets such as COCO and OpenImages.

BibTeX entry and citation info

@article{minderer2022simple,
  title={Simple Open-Vocabulary Object Detection with Vision Transformers},
  author={Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, Xiao Wang, Xiaohua Zhai, Thomas Kipf, Neil Houlsby},
  journal={arXiv preprint arXiv:2205.06230},
  year={2022},
}

📄 License

This model is licensed under the Apache-2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご