Open-source OWLv2-base-patch16-ensemble Model - Zero-shot Image Object Localization, Effortlessly Locate Targets with Text Queries

Home

Owlv2 Base Patch16 Ensemble

Developed by upfeatmediainc

OWLv2 is a zero-shot text-conditioned object detection model that can locate objects in images through text queries.

Object Detection

Transformers

Open Source License:Apache-2.0 #Zero-shot Object Detection #Open-vocabulary Recognition #Multimodal Vision Model

Downloads 15

Release Time : 11/10/2023

Model Overview

OWLv2 is an open-world localization model based on CLIP backbone, supporting zero-shot object detection via text queries.

Model Features

Zero-shot Detection

Detects novel category objects directly through text queries without training on specific classes.

Open Vocabulary

Supports object detection with arbitrary text descriptions, unrestricted by predefined categories.

Multi-query Detection

Single image can simultaneously respond to multiple text queries for detection.

Model Capabilities

Text-conditioned Object Detection

Open-vocabulary Recognition

Multi-object Localization

Use Cases

Computer Vision Research

Zero-shot Object Detection Research

Used to explore the model's detection capability on unseen categories.

Interdisciplinary Applications

Specialized Domain Object Recognition

Identifies specific objects in domains lacking annotated data (e.g., medical, agricultural).

🚀 Model Card: OWLv2

The OWLv2 model is a zero - shot text - conditioned object detection model, which can query an image using one or multiple text queries, facilitating research in open - vocabulary object detection.

🚀 Quick Start

The OWLv2 model can be easily used with the Transformers library. Here is a code example demonstrating how to perform object detection using the model:

import requests
from PIL import Image
import torch

from transformers import Owlv2Processor, Owlv2ForObjectDetection

processor = Owlv2Processor.from_pretrained("google/owlv2-base-patch16-ensemble")
model = Owlv2ForObjectDetection.from_pretrained("google/owlv2-base-patch16-ensemble")

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
texts = [["a photo of a cat", "a photo of a dog"]]
inputs = processor(text=texts, images=image, return_tensors="pt")
outputs = model(**inputs)

# Target image sizes (height, width) to rescale box predictions [batch_size, 2]
target_sizes = torch.Tensor([image.size[::-1]])
# Convert outputs (bounding boxes and class logits) to COCO API
results = processor.post_process_object_detection(outputs=outputs, threshold=0.1, target_sizes=target_sizes)

i = 0  # Retrieve predictions for the first image for the corresponding text queries
text = texts[i]
boxes, scores, labels = results[i]["boxes"], results[i]["scores"], results[i]["labels"]

# Print detected objects and rescaled box coordinates
for box, score, label in zip(boxes, scores, labels):
    box = [round(i, 2) for i in box.tolist()]
    print(f"Detected {text[label]} with confidence {round(score.item(), 3)} at location {box}")

✨ Features

Zero - shot Text - conditioned Detection: The OWLv2 model can perform object detection using one or multiple text queries without the need for additional training on specific object classes.
Open - vocabulary Detection: It enables the detection of objects with open - vocabulary, replacing fixed classification layer weights with class - name embeddings from the text model.

📚 Documentation

Model Details

The OWLv2 model (short for Open - World Localization) was proposed in Scaling Open - Vocabulary Object Detection by Matthias Minderer, Alexey Gritsenko, Neil Houlsby. OWLv2, like OWL - ViT, is a zero - shot text - conditioned object detection model that can be used to query an image with one or multiple text queries.

The model uses CLIP as its multi - modal backbone, with a ViT - like Transformer to get visual features and a causal language model to get the text features. To use CLIP for detection, OWL - ViT removes the final token pooling layer of the vision model and attaches a lightweight classification and box head to each transformer output token. Open - vocabulary classification is enabled by replacing the fixed classification layer weights with the class - name embeddings obtained from the text model. The authors first train CLIP from scratch and fine - tune it end - to - end with the classification and box heads on standard detection datasets using a bipartite matching loss. One or multiple text queries per image can be used to perform zero - shot text - conditioned object detection.

Model Date

June 2023

Model Type

Property	Details
Model Type	The model uses a CLIP backbone with a ViT - B/16 Transformer architecture as an image encoder and uses a masked self - attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss. The CLIP backbone is trained from scratch and fine - tuned together with the box and class prediction heads with an object detection objective.

Property

Details

Model Type

The model uses a CLIP backbone with a ViT - B/16 Transformer architecture as an image encoder and uses a masked self - attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss. The CLIP backbone is trained from scratch and fine - tuned together with the box and class prediction heads with an object detection objective.

Documents

OWLv2 Paper

🔧 Technical Details

Model Architecture

The OWLv2 model leverages the CLIP backbone. The image encoder is based on a ViT - B/16 Transformer architecture, while the text encoder is a masked self - attention Transformer. The model is trained to maximize the similarity between image - text pairs using a contrastive loss. The CLIP backbone is trained from scratch and then fine - tuned along with the box and class prediction heads for object detection.

Training Data

The CLIP backbone of the model was trained on publicly available image - caption data. This was done through a combination of crawling a handful of websites and using commonly - used pre - existing image datasets such as [YFCC100M](http://projects.dfki.uni - kl.de/yfcc100m/). A large portion of the data comes from our crawling of the internet. This means that the data is more representative of people and societies most connected to the internet. The prediction heads of OWL - ViT, along with the CLIP backbone, are fine - tuned on publicly available object detection datasets such as COCO and OpenImages.

(to be updated for v2)

BibTeX entry and citation info

@misc{minderer2023scaling,
      title={Scaling Open - Vocabulary Object Detection}, 
      author={Matthias Minderer and Alexey Gritsenko and Neil Houlsby},
      year={2023},
      eprint={2306.09683},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

📄 License

The model is licensed under the Apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご