Open-source owlv2-large-patch14 model - Free implementation of image object text query detection under zero-shot conditions

Owlv2 Large Patch14

Developed by google

OWLv2 is a zero-shot text-conditioned object detection model that can detect objects in images through text queries without requiring category-specific training data.

Text-to-Image

Transformers

Open Source License:Apache-2.0 #Zero-shot Object Detection #Open-vocabulary Localization #Multimodal Vision Model

Downloads 3,679

Release Time : 10/13/2023

Model Overview

OWLv2 is a CLIP-based open-vocabulary object detection model using ViT-L/14 as the visual encoder, capable of detecting objects in images through natural language descriptions.

Model Features

Zero-shot Detection Capability

Detects novel category objects through text descriptions without requiring category-specific training data.

Open-vocabulary Understanding

Capable of understanding and detecting object categories not present in training data.

Multi-query Detection

Supports simultaneous object detection using multiple text queries.

Model Capabilities

Object detection in images

Text-conditioned object localization

Open-vocabulary recognition

Simultaneous multi-category detection

Use Cases

Computer Vision Research

Zero-shot Object Detection Research

Investigating the model's detection capability on unseen categories

Industrial Applications

Inventory Management

Detecting items in warehouses through natural language descriptions

🚀 Model Card: OWLv2

The OWLv2 model offers zero - shot text - conditioned object detection capabilities. It uses CLIP as a backbone and can be queried with text to detect objects in an image, which is valuable for research in computer vision.

🚀 Quick Start

If you want to use the OWLv2 model with the Transformers library, you can refer to the following code example:

import requests
from PIL import Image
import torch

from transformers import Owlv2Processor, Owlv2ForObjectDetection

processor = Owlv2Processor.from_pretrained("google/owlv2-large-patch14")
model = Owlv2ForObjectDetection.from_pretrained("google/owlv2-large-patch14")

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
texts = [["a photo of a cat", "a photo of a dog"]]
inputs = processor(text=texts, images=image, return_tensors="pt")

with torch.no_grad():
  outputs = model(**inputs)

# Target image sizes (height, width) to rescale box predictions [batch_size, 2]
target_sizes = torch.Tensor([image.size[::-1]])
# Convert outputs (bounding boxes and class logits) to Pascal VOC Format (xmin, ymin, xmax, ymax)
results = processor.post_process_object_detection(outputs=outputs, target_sizes=target_sizes, threshold=0.1)
i = 0  # Retrieve predictions for the first image for the corresponding text queries
text = texts[i]
boxes, scores, labels = results[i]["boxes"], results[i]["scores"], results[i]["labels"]
for box, score, label in zip(boxes, scores, labels):
    box = [round(i, 2) for i in box.tolist()]
    print(f"Detected {text[label]} with confidence {round(score.item(), 3)} at location {box}")

✨ Features

Zero - shot Detection: The OWLv2 model can perform zero - shot text - conditioned object detection, allowing users to query an image with one or multiple text queries.
Multi - modal Backbone: It uses CLIP as its multi - modal backbone, combining a ViT - like Transformer for visual features and a causal language model for text features.
Open - vocabulary Classification: By replacing the fixed classification layer weights with class - name embeddings from the text model, it enables open - vocabulary classification.

📚 Documentation

Model Details

The OWLv2 model (short for Open - World Localization) was proposed in Scaling Open - Vocabulary Object Detection by Matthias Minderer, Alexey Gritsenko, Neil Houlsby. Similar to OWL - ViT, it is a zero - shot text - conditioned object detection model.

The model uses CLIP as its multi - modal backbone. It has a ViT - like Transformer to obtain visual features and a causal language model to get text features. To use CLIP for detection, OWL - ViT removes the final token pooling layer of the vision model and attaches a lightweight classification and box head to each transformer output token. Open - vocabulary classification is achieved by replacing the fixed classification layer weights with class - name embeddings from the text model. The authors first train CLIP from scratch and fine - tune it end - to - end with the classification and box heads on standard detection datasets using a bipartite matching loss. One or multiple text queries per image can be used for zero - shot text - conditioned object detection.

Model Date

June 2023

Model Type

Property	Details
Model Type	The model uses a CLIP backbone with a ViT - L/14 Transformer architecture as an image encoder and a masked self - attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss. The CLIP backbone is trained from scratch and fine - tuned together with the box and class prediction heads with an object detection objective.
Training Data	The CLIP backbone of the model was trained on publicly available image - caption data through a combination of crawling websites and using pre - existing image datasets such as YFCC100M. A large portion of the data comes from internet crawling. The prediction heads of OWL - ViT, along with the CLIP backbone, are fine - tuned on publicly available object detection datasets such as COCO and OpenImages.

Property

Details

Model Type

The model uses a CLIP backbone with a ViT - L/14 Transformer architecture as an image encoder and a masked self - attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss. The CLIP backbone is trained from scratch and fine - tuned together with the box and class prediction heads with an object detection objective.

Training Data

The CLIP backbone of the model was trained on publicly available image - caption data through a combination of crawling websites and using pre - existing image datasets such as YFCC100M. A large portion of the data comes from internet crawling. The prediction heads of OWL - ViT, along with the CLIP backbone, are fine - tuned on publicly available object detection datasets such as COCO and OpenImages.

Documents

OWLv2 Paper

Model Use

Intended Use

The model is intended as a research output for research communities. It aims to help researchers better understand and explore zero - shot, text - conditioned object detection. It can also be used for interdisciplinary studies of the potential impact of such models, especially in areas where identifying objects with unavailable labels during training is required.

Primary intended uses

The primary intended users of these models are AI researchers. We mainly envision that the model will be used by researchers to better understand the robustness, generalization, and other capabilities, biases, and constraints of computer vision models.

Data

The CLIP backbone of the model was trained on publicly available image - caption data. This was accomplished by crawling several websites and using commonly - used pre - existing image datasets like YFCC100M. A significant part of the data comes from internet crawling, which means the data is more representative of people and societies most connected to the internet. The prediction heads of OWL - ViT, along with the CLIP backbone, are fine - tuned on publicly available object detection datasets such as COCO and OpenImages.

(to be updated for v2)

BibTeX entry and citation info

@misc{minderer2023scaling,
      title={Scaling Open-Vocabulary Object Detection}, 
      author={Matthias Minderer and Alexey Gritsenko and Neil Houlsby},
      year={2023},
      eprint={2306.09683},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

📄 License

The model is released under the Apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご