OWLv2-base-patch16 Open-source Model - Searching for Image Objects via Text Queries under Zero-shot Conditions

Home

Owlv2 Base Patch16

Developed by google

OWLv2 is a zero-shot text-conditioned object detection model that can retrieve objects in images through text queries.

Text-to-Image

Transformers

Open Source License:Apache-2.0 #Zero-shot object detection #Open-vocabulary localization #CLIP backbone network

Downloads 15.42k

Release Time : 10/13/2023

Model Overview

OWLv2 is an open-world localization model based on the CLIP backbone network, supporting zero-shot object detection via text queries.

Model Features

Zero-shot detection

Detects new objects through text queries without category-specific training

Open-vocabulary classification

Enables detection of arbitrary text categories by replacing classification layer weights

Multi-query support

Supports simultaneous search for objects matching multiple text descriptions in a single image

Model Capabilities

Image object detection

Text-conditioned search

Open-vocabulary recognition

Use Cases

Computer vision research

Zero-shot detection research

Exploring the model's recognition capability for unseen categories

Interdisciplinary applications

Specialized domain object recognition

Performing object detection in domains lacking annotated data (e.g., medical images)

🚀 Model Card: OWLv2

The OWLv2 model is a zero - shot text - conditioned object detection model. It allows users to query an image using one or multiple text queries, offering a powerful tool for open - world object detection.

🚀 Quick Start

Use with Transformers

import requests
from PIL import Image
import numpy as np
import torch
from transformers import AutoProcessor, Owlv2ForObjectDetection
from transformers.utils.constants import OPENAI_CLIP_MEAN, OPENAI_CLIP_STD

processor = AutoProcessor.from_pretrained("google/owlv2-base-patch16")
model = Owlv2ForObjectDetection.from_pretrained("google/owlv2-base-patch16")

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
texts = [["a photo of a cat", "a photo of a dog"]]
inputs = processor(text=texts, images=image, return_tensors="pt")

# forward pass
with torch.no_grad():
    outputs = model(**inputs)

# Note: boxes need to be visualized on the padded, unnormalized image
# hence we'll set the target image sizes (height, width) based on that

def get_preprocessed_image(pixel_values):
    pixel_values = pixel_values.squeeze().numpy()
    unnormalized_image = (pixel_values * np.array(OPENAI_CLIP_STD)[:, None, None]) + np.array(OPENAI_CLIP_MEAN)[:, None, None]
    unnormalized_image = (unnormalized_image * 255).astype(np.uint8)
    unnormalized_image = np.moveaxis(unnormalized_image, 0, -1)
    unnormalized_image = Image.fromarray(unnormalized_image)
    return unnormalized_image

unnormalized_image = get_preprocessed_image(inputs.pixel_values)

target_sizes = torch.Tensor([unnormalized_image.size[::-1]])
# Convert outputs (bounding boxes and class logits) to final bounding boxes and scores
results = processor.post_process_object_detection(
    outputs=outputs, threshold=0.2, target_sizes=target_sizes
)

i = 0  # Retrieve predictions for the first image for the corresponding text queries
text = texts[i]
boxes, scores, labels = results[i]["boxes"], results[i]["scores"], results[i]["labels"]

for box, score, label in zip(boxes, scores, labels):
    box = [round(i, 2) for i in box.tolist()]
    print(f"Detected {text[label]} with confidence {round(score.item(), 3)} at location {box}")

✨ Features

The OWLv2 model (short for Open - World Localization) was proposed in Scaling Open - Vocabulary Object Detection by Matthias Minderer, Alexey Gritsenko, Neil Houlsby. Like OWL - ViT, it is a zero - shot text - conditioned object detection model.

The model uses CLIP as its multi - modal backbone. It has a ViT - like Transformer to extract visual features and a causal language model to obtain text features. To adapt CLIP for detection, OWL - ViT removes the final token pooling layer of the vision model and attaches a lightweight classification and box head to each transformer output token. Open - vocabulary classification is achieved by replacing the fixed classification layer weights with the class - name embeddings from the text model. The authors first train CLIP from scratch and then fine - tune it end - to - end with the classification and box heads on standard detection datasets using a bipartite matching loss. It can perform zero - shot text - conditioned object detection with one or multiple text queries per image.

📚 Documentation

Model Date

June 2023

Model Type

Property	Details
Model Type	The model uses a CLIP backbone with a ViT - B/16 Transformer architecture as an image encoder and a masked self - attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss. The CLIP backbone is trained from scratch and fine - tuned together with the box and class prediction heads with an object detection objective.
Training Data	The CLIP backbone of the model was trained on publicly available image - caption data through a combination of crawling a handful of websites and using pre - existing image datasets such as YFCC100M. A large portion of the data comes from internet crawling. The prediction heads of OWL - ViT, along with the CLIP backbone, are fine - tuned on publicly available object detection datasets such as COCO and OpenImages.

Property

Details

Model Type

The model uses a CLIP backbone with a ViT - B/16 Transformer architecture as an image encoder and a masked self - attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss. The CLIP backbone is trained from scratch and fine - tuned together with the box and class prediction heads with an object detection objective.

Training Data

The CLIP backbone of the model was trained on publicly available image - caption data through a combination of crawling a handful of websites and using pre - existing image datasets such as YFCC100M. A large portion of the data comes from internet crawling. The prediction heads of OWL - ViT, along with the CLIP backbone, are fine - tuned on publicly available object detection datasets such as COCO and OpenImages.

Documents

OWLv2 Paper

Model Use

Intended Use

The model is intended as a research output for research communities. We hope it will enable researchers to better understand and explore zero - shot, text - conditioned object detection. It can also be used for interdisciplinary studies of the potential impact of such models, especially in areas that commonly require identifying objects whose label is unavailable during training.

Primary intended uses

The primary intended users of these models are AI researchers. We primarily imagine the model will be used by researchers to better understand robustness, generalization, and other capabilities, biases, and constraints of computer vision models.

BibTeX entry and citation info

@misc{minderer2023scaling,
      title={Scaling Open-Vocabulary Object Detection}, 
      author={Matthias Minderer and Alexey Gritsenko and Neil Houlsby},
      year={2023},
      eprint={2306.09683},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

📄 License

This model is licensed under the Apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご