Open-source OWLv2-large-patch14-finetuned model - Zero-shot detection of objects in images without specific training data

Owlv2 Large Patch14 Finetuned

Developed by google

OWLv2 is a zero-shot text-conditioned object detection model that can detect objects in images through text queries without requiring category-specific training data.

Text-to-Image

Transformers

Open Source License:Apache-2.0 #Zero-shot object detection #Open-vocabulary recognition #Multimodal vision model

Downloads 1,434

Release Time : 10/14/2023

Model Overview

OWLv2 is a zero-shot text-conditioned object detection model based on the CLIP backbone network, capable of detecting objects in images using one or more text queries. It employs ViT-L/14 as the visual encoder, is trained with contrastive loss, and fine-tuned on standard detection datasets.

Model Features

Zero-shot detection capability

Detects objects in images through text queries without requiring category-specific training data.

Open-vocabulary classification

Supports detection of arbitrary class names by replacing fixed classification layer weights with text embeddings.

Multi-query detection

Supports simultaneous detection of different objects in images using one or more text queries.

Model Capabilities

Text-conditioned object detection

Open-vocabulary object recognition

Multi-category simultaneous detection

Use Cases

Computer vision research

Zero-shot object detection research

Used to study the model's detection capability on unseen categories.

Interdisciplinary applications

Special scenario object recognition

Performs object detection in specialized fields (e.g., medical, industrial) where training data is difficult to obtain.

🚀 Model Card: OWLv2

The OWLv2 model is a zero - shot text - conditioned object detection model. It can query an image using one or multiple text queries, offering a new approach for open - world object detection.

🚀 Quick Start

To use the OWLv2 model with the Transformers library, you can follow this code example:

import requests
from PIL import Image
import torch

from transformers import Owlv2Processor, Owlv2ForObjectDetection

processor = Owlv2Processor.from_pretrained("google/owlv2-large-patch14-finetuned")
model = Owlv2ForObjectDetection.from_pretrained("google/owlv2-large-patch14-finetuned")

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
texts = [["a photo of a cat", "a photo of a dog"]]
inputs = processor(text=texts, images=image, return_tensors="pt")
outputs = model(**inputs)

# Target image sizes (height, width) to rescale box predictions [batch_size, 2]
target_sizes = torch.Tensor([image.size[::-1]])
# Convert outputs (bounding boxes and class logits) to COCO API
results = processor.post_process_object_detection(outputs=outputs, threshold=0.1, target_sizes=target_sizes)

i = 0  # Retrieve predictions for the first image for the corresponding text queries
text = texts[i]
boxes, scores, labels = results[i]["boxes"], results[i]["scores"], results[i]["labels"]

# Print detected objects and rescaled box coordinates
for box, score, label in zip(boxes, scores, labels):
    box = [round(i, 2) for i in box.tolist()]
    print(f"Detected {text[label]} with confidence {round(score.item(), 3)} at location {box}")

✨ Features

Zero - shot Detection: OWLv2 can perform zero - shot text - conditioned object detection, allowing users to query images with text without prior training on specific object classes.
Multi - modal Backbone: It uses CLIP as its multi - modal backbone, combining visual and text features effectively.
Open - vocabulary Classification: Enables open - vocabulary classification by replacing fixed classification layer weights with class - name embeddings from the text model.

📦 Installation

The README doesn't provide specific installation steps, so this section is skipped.

💻 Usage Examples

Basic Usage

import requests
from PIL import Image
import torch

from transformers import Owlv2Processor, Owlv2ForObjectDetection

processor = Owlv2Processor.from_pretrained("google/owlv2-large-patch14-finetuned")
model = Owlv2ForObjectDetection.from_pretrained("google/owlv2-large-patch14-finetuned")

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
texts = [["a photo of a cat", "a photo of a dog"]]
inputs = processor(text=texts, images=image, return_tensors="pt")
outputs = model(**inputs)

# Target image sizes (height, width) to rescale box predictions [batch_size, 2]
target_sizes = torch.Tensor([image.size[::-1]])
# Convert outputs (bounding boxes and class logits) to COCO API
results = processor.post_process_object_detection(outputs=outputs, threshold=0.1, target_sizes=target_sizes)

i = 0  # Retrieve predictions for the first image for the corresponding text queries
text = texts[i]
boxes, scores, labels = results[i]["boxes"], results[i]["scores"], results[i]["labels"]

# Print detected objects and rescaled box coordinates
for box, score, label in zip(boxes, scores, labels):
    box = [round(i, 2) for i in box.tolist()]
    print(f"Detected {text[label]} with confidence {round(score.item(), 3)} at location {box}")

Advanced Usage

The README doesn't provide advanced usage examples, so this part is skipped.

📚 Documentation

OWLv2 Paper

🔧 Technical Details

Model Details

The OWLv2 model (short for Open - World Localization) was proposed in Scaling Open - Vocabulary Object Detection by Matthias Minderer, Alexey Gritsenko, Neil Houlsby. It uses CLIP as its multi - modal backbone, with a ViT - like Transformer for visual features and a causal language model for text features.

To use CLIP for detection, OWL - ViT removes the final token pooling layer of the vision model and attaches a lightweight classification and box head to each transformer output token. Open - vocabulary classification is enabled by replacing the fixed classification layer weights with the class - name embeddings obtained from the text model. The authors first train CLIP from scratch and fine - tune it end - to - end with the classification and box heads on standard detection datasets using a bipartite matching loss.

Model Date

June 2023

Model Type

Property	Details
Model Type	The model uses a CLIP backbone with a ViT - L/14 Transformer architecture as an image encoder and uses a masked self - attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss. The CLIP backbone is trained from scratch and fine - tuned together with the box and class prediction heads with an object detection objective.
Training Data	The CLIP backbone of the model was trained on publicly available image - caption data. This was done through a combination of crawling a handful of websites and using commonly - used pre - existing image datasets such as [YFCC100M](http://projects.dfki.uni - kl.de/yfcc100m/). A large portion of the data comes from crawling the internet. The prediction heads of OWL - ViT, along with the CLIP backbone, are fine - tuned on publicly available object detection datasets such as COCO and OpenImages.

Property

Details

Model Type

The model uses a CLIP backbone with a ViT - L/14 Transformer architecture as an image encoder and uses a masked self - attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss. The CLIP backbone is trained from scratch and fine - tuned together with the box and class prediction heads with an object detection objective.

Training Data

The CLIP backbone of the model was trained on publicly available image - caption data. This was done through a combination of crawling a handful of websites and using commonly - used pre - existing image datasets such as [YFCC100M](http://projects.dfki.uni - kl.de/yfcc100m/). A large portion of the data comes from crawling the internet. The prediction heads of OWL - ViT, along with the CLIP backbone, are fine - tuned on publicly available object detection datasets such as COCO and OpenImages.

📄 License

The model is licensed under the Apache 2.0 license.

BibTeX entry and citation info

@misc{minderer2023scaling,
      title={Scaling Open-Vocabulary Object Detection}, 
      author={Matthias Minderer and Alexey Gritsenko and Neil Houlsby},
      year={2023},
      eprint={2306.09683},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご