OWL-ViT (owlvit-base-patch32) Open-source Model - Zero-shot Free Detection, Search for Objects in Images by Text!

Owlvit Base Patch32

Developed by google

OWL-ViT is a zero-shot text-conditioned object detection model that can search for objects in images via text queries without requiring category-specific training data.

Text-to-Image

Transformers

Open Source License:Apache-2.0 #Zero-shot object detection #Open-vocabulary recognition #Multimodal vision model

Downloads 764.95k

Release Time : 7/5/2022

Model Overview

OWL-ViT employs CLIP as a multimodal backbone network, combining ViT-style Transformers with lightweight prediction heads to achieve open-vocabulary object detection. It can directly detect objects in images through text descriptions, supporting zero-shot transfer.

Model Features

Zero-shot detection capability

Detects novel category objects directly through text descriptions without requiring category-specific training data

Open-vocabulary support

Can handle unseen category names during training, enabling open-world object detection

Multimodal architecture

Combines visual Transformers and text Transformers for joint understanding of images and text

Model Capabilities

Zero-shot object detection

Text-conditioned image search

Open-vocabulary recognition

Multimodal understanding

Use Cases

Computer vision research

Zero-shot object detection research

Investigates the model's generalization ability on unseen categories

Practical applications

Image content retrieval

Search for specific objects in images using natural language descriptions

Intelligent surveillance

Detect specific targets in surveillance footage using natural language queries

🚀 Model Card: OWL-ViT

OWL-ViT is a zero-shot text-conditioned object detection model. It can query an image with one or multiple text queries, which is significant for research in open-vocabulary object detection.

✨ Features

Zero-shot Detection: Capable of detecting objects in an image using text queries without prior training on specific object classes.
Multi-modal Backbone: Utilizes CLIP, combining a ViT - like Transformer for visual features and a causal language model for text features.
Open-vocabulary Classification: Enables classification of objects with open - vocabulary by using class - name embeddings from the text model.

📦 Installation

This section is not provided in the original document, so it is skipped.

💻 Usage Examples

Basic Usage

import requests
from PIL import Image
import torch

from transformers import OwlViTProcessor, OwlViTForObjectDetection

processor = OwlViTProcessor.from_pretrained("google/owlvit-base-patch32")
model = OwlViTForObjectDetection.from_pretrained("google/owlvit-base-patch32")

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
texts = [["a photo of a cat", "a photo of a dog"]]
inputs = processor(text=texts, images=image, return_tensors="pt")
outputs = model(**inputs)

# Target image sizes (height, width) to rescale box predictions [batch_size, 2]
target_sizes = torch.Tensor([image.size[::-1]])
# Convert outputs (bounding boxes and class logits) to COCO API
results = processor.post_process_object_detection(outputs=outputs, threshold=0.1, target_sizes=target_sizes)

i = 0  # Retrieve predictions for the first image for the corresponding text queries
text = texts[i]
boxes, scores, labels = results[i]["boxes"], results[i]["scores"], results[i]["labels"]

# Print detected objects and rescaled box coordinates
for box, score, label in zip(boxes, scores, labels):
    box = [round(i, 2) for i in box.tolist()]
    print(f"Detected {text[label]} with confidence {round(score.item(), 3)} at location {box}")

📚 Documentation

OWL-ViT Paper

🔧 Technical Details

Model Details

The OWL-ViT (Vision Transformer for Open-World Localization) was proposed in Simple Open-Vocabulary Object Detection with Vision Transformers by Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, Xiao Wang, Xiaohua Zhai, Thomas Kipf, and Neil Houlsby.

Model Date

May 2022

Model Type

Property	Details
Model Type	The model uses a CLIP backbone with a ViT - B/32 Transformer architecture as an image encoder and uses a masked self - attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss. The CLIP backbone is trained from scratch and fine - tuned together with the box and class prediction heads with an object detection objective.
Training Data	The CLIP backbone of the model was trained on publicly available image - caption data through a combination of crawling websites and using pre - existing image datasets such as YFCC100M. A large portion of the data comes from internet crawling. The prediction heads of OWL - ViT, along with the CLIP backbone, are fine - tuned on publicly available object detection datasets such as COCO and OpenImages.

Property

Details

Model Type

The model uses a CLIP backbone with a ViT - B/32 Transformer architecture as an image encoder and uses a masked self - attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss. The CLIP backbone is trained from scratch and fine - tuned together with the box and class prediction heads with an object detection objective.

Training Data

The CLIP backbone of the model was trained on publicly available image - caption data through a combination of crawling websites and using pre - existing image datasets such as YFCC100M. A large portion of the data comes from internet crawling. The prediction heads of OWL - ViT, along with the CLIP backbone, are fine - tuned on publicly available object detection datasets such as COCO and OpenImages.

Model Use

Intended Use

The model is intended as a research output for research communities. We hope that this model will enable researchers to better understand and explore zero - shot, text - conditioned object detection. We also hope it can be used for interdisciplinary studies of the potential impact of such models, especially in areas that commonly require identifying objects whose label is unavailable during training.

Primary intended uses

The primary intended users of these models are AI researchers. We primarily imagine the model will be used by researchers to better understand robustness, generalization, and other capabilities, biases, and constraints of computer vision models.

BibTeX entry and citation info

@article{minderer2022simple,
  title={Simple Open-Vocabulary Object Detection with Vision Transformers},
  author={Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, Xiao Wang, Xiaohua Zhai, Thomas Kipf, Neil Houlsby},
  journal={arXiv preprint arXiv:2205.06230},
  year={2022},
}

📄 License

The model is licensed under the Apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご