Open-source owlv2-base-patch16 model - Zero-shot and free implementation of image object detection and localization

Owlv2 Base Patch16

Developed by vvmnnnkv

OWLv2 is a zero-shot text-conditioned object detection model that can detect and locate objects in images through text queries.

Text-to-Image

Transformers

Open Source License:Apache-2.0 #Zero-shot object detection #Open-vocabulary recognition #Text-conditioned detection

Downloads 26

Release Time : 10/27/2023

Model Overview

OWLv2 is an open-vocabulary object detection model based on the CLIP backbone network, capable of achieving zero-shot object detection via text queries without the need for training on specific categories.

Model Features

Zero-shot detection capability

No need for training on specific categories; can directly detect new category objects through text queries.

Open-vocabulary recognition

Capable of recognizing unseen category names during training, breaking the category limitations of traditional detection models.

Multi-query support

Supports using multiple text queries simultaneously for object detection, improving detection efficiency.

Model Capabilities

Image object detection

Text-conditioned localization

Open-vocabulary recognition

Use Cases

Computer vision research

Zero-shot object detection research

Used to study the robustness, generalization capabilities, and other characteristics of computer vision models.

Practical applications

Scene object recognition

Quickly identify specific objects in unknown environments, such as airports, grasslands, etc.

🚀 Model Card: OWLv2

The OWLv2 model is a zero - shot text - conditioned object detection model, which can query an image with one or multiple text queries.

🚀 Quick Start

The OWLv2 model can be used with the transformers library. Here is a basic code example:

import requests
from PIL import Image
import torch

from transformers import Owlv2Processor, Owlv2ForObjectDetection

processor = Owlv2Processor.from_pretrained("google/owlv2-base-patch16")
model = Owlv2ForObjectDetection.from_pretrained("google/owlv2-base-patch16")

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
texts = [["a photo of a cat", "a photo of a dog"]]
inputs = processor(text=texts, images=image, return_tensors="pt")
outputs = model(**inputs)

# Target image sizes (height, width) to rescale box predictions [batch_size, 2]
target_sizes = torch.Tensor([image.size[::-1]])
# Convert outputs (bounding boxes and class logits) to COCO API
results = processor.post_process_object_detection(outputs=outputs, threshold=0.1, target_sizes=target_sizes)

i = 0  # Retrieve predictions for the first image for the corresponding text queries
text = texts[i]
boxes, scores, labels = results[i]["boxes"], results[i]["scores"], results[i]["labels"]

# Print detected objects and rescaled box coordinates
for box, score, label in zip(boxes, scores, labels):
    box = [round(i, 2) for i in box.tolist()]
    print(f"Detected {text[label]} with confidence {round(score.item(), 3)} at location {box}")

✨ Features

The OWLv2 model is a zero - shot text - conditioned object detection model, similar to OWL - ViT.
It uses CLIP as its multi - modal backbone, with a ViT - like Transformer for visual features and a causal language model for text features.
Open - vocabulary classification is enabled by replacing the fixed classification layer weights with the class - name embeddings obtained from the text model.
It can perform zero - shot text - conditioned object detection with one or multiple text queries per image.

📦 Installation

The code example above uses the transformers library. You can install it using the following command:

pip install transformers

📚 Documentation

OWLv2 Paper

🔧 Technical Details

Model Date

June 2023

Model Type

Property	Details
Model Type	The model uses a CLIP backbone with a ViT - B/16 Transformer architecture as an image encoder and uses a masked self - attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss. The CLIP backbone is trained from scratch and fine - tuned together with the box and class prediction heads with an object detection objective.
Training Data	The CLIP backbone of the model was trained on publicly available image - caption data through a combination of crawling a handful of websites and using commonly - used pre - existing image datasets such as YFCC100M. A large portion of the data comes from internet crawling. The prediction heads of OWL - ViT, along with the CLIP backbone, are fine - tuned on publicly available object detection datasets such as COCO and OpenImages. (to be updated for v2)

Property

Details

Model Type

The model uses a CLIP backbone with a ViT - B/16 Transformer architecture as an image encoder and uses a masked self - attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss. The CLIP backbone is trained from scratch and fine - tuned together with the box and class prediction heads with an object detection objective.

Training Data

The CLIP backbone of the model was trained on publicly available image - caption data through a combination of crawling a handful of websites and using commonly - used pre - existing image datasets such as YFCC100M. A large portion of the data comes from internet crawling. The prediction heads of OWL - ViT, along with the CLIP backbone, are fine - tuned on publicly available object detection datasets such as COCO and OpenImages. (to be updated for v2)

BibTeX entry and citation info

@misc{minderer2023scaling,
      title={Scaling Open-Vocabulary Object Detection}, 
      author={Matthias Minderer and Alexey Gritsenko and Neil Houlsby},
      year={2023},
      eprint={2306.09683},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

📄 License

This model is licensed under the Apache 2.0 license.

💻 Usage Examples

Basic Usage

The code example in the "Quick Start" section demonstrates the basic usage of the OWLv2 model with the transformers library.

Model Use

Intended Use

The model is intended as a research output for research communities.

Primary intended users: AI researchers.
Primary uses: Researchers can use the model to better understand robustness, generalization, and other capabilities, biases, and constraints of computer vision models. It can also be used for interdisciplinary studies of the potential impact of such models, especially in areas that commonly require identifying objects whose label is unavailable during training.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご