OWLv2-base-patch16開源模型 - 零樣本條件下依文本查詢檢索圖像對象

首頁

Owlv2 Base Patch16

由google開發

OWLv2是一種零樣本文本條件目標檢測模型，可通過文本查詢檢索圖像中的對象。

文本生成圖像

Transformers

開源協議:Apache-2.0 #零樣本目標檢測 #開放詞彙定位 #CLIP主幹網絡

下載量 15.42k

發布時間 : 10/13/2023

模型概述

OWLv2是基於CLIP主幹網絡的開放世界定位模型，支持通過文本查詢進行零樣本目標檢測。

模型特點

零樣本檢測

無需特定類別訓練即可通過文本查詢檢測新對象

開放詞彙分類

通過替換分類層權重實現任意文本類別的檢測

多查詢支持

支持單張圖像中同時搜索多個文本描述的對象

模型能力

圖像目標檢測

文本條件搜索

開放詞彙識別

使用案例

計算機視覺研究

零樣本檢測研究

探索模型對未見類別的識別能力

跨學科應用

特殊領域物體識別

在缺乏標註數據的領域（如醫學圖像）進行物體檢測

🚀 模型卡片：OWLv2

OWLv2模型（開放世界定位的縮寫）是一種零樣本的文本條件目標檢測模型，可使用一個或多個文本查詢對圖像進行查詢。該模型使用CLIP作為其多模態主幹，結合視覺和文本特徵，實現開放詞彙的目標檢測。

🚀 快速開始

使用Transformers庫調用模型

import requests
from PIL import Image
import numpy as np
import torch
from transformers import AutoProcessor, Owlv2ForObjectDetection
from transformers.utils.constants import OPENAI_CLIP_MEAN, OPENAI_CLIP_STD

processor = AutoProcessor.from_pretrained("google/owlv2-base-patch16")
model = Owlv2ForObjectDetection.from_pretrained("google/owlv2-base-patch16")

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
texts = [["a photo of a cat", "a photo of a dog"]]
inputs = processor(text=texts, images=image, return_tensors="pt")

# forward pass
with torch.no_grad():
    outputs = model(**inputs)

# Note: boxes need to be visualized on the padded, unnormalized image
# hence we'll set the target image sizes (height, width) based on that

def get_preprocessed_image(pixel_values):
    pixel_values = pixel_values.squeeze().numpy()
    unnormalized_image = (pixel_values * np.array(OPENAI_CLIP_STD)[:, None, None]) + np.array(OPENAI_CLIP_MEAN)[:, None, None]
    unnormalized_image = (unnormalized_image * 255).astype(np.uint8)
    unnormalized_image = np.moveaxis(unnormalized_image, 0, -1)
    unnormalized_image = Image.fromarray(unnormalized_image)
    return unnormalized_image

unnormalized_image = get_preprocessed_image(inputs.pixel_values)

target_sizes = torch.Tensor([unnormalized_image.size[::-1]])
# Convert outputs (bounding boxes and class logits) to final bounding boxes and scores
results = processor.post_process_object_detection(
    outputs=outputs, threshold=0.2, target_sizes=target_sizes
)

i = 0  # Retrieve predictions for the first image for the corresponding text queries
text = texts[i]
boxes, scores, labels = results[i]["boxes"], results[i]["scores"], results[i]["labels"]

for box, score, label in zip(boxes, scores, labels):
    box = [round(i, 2) for i in box.tolist()]
    print(f"Detected {text[label]} with confidence {round(score.item(), 3)} at location {box}")

✨ 主要特性

零樣本檢測：能夠使用文本查詢進行零樣本目標檢測，無需針對特定類別進行訓練。
開放詞彙檢測：通過使用文本模型獲取的類名嵌入，實現開放詞彙的分類。
多模態骨幹：使用CLIP作為多模態骨幹，結合視覺和文本特徵。

📦 模型詳情

模型背景

OWLv2模型（開放世界定位的縮寫）由Matthias Minderer、Alexey Gritsenko和Neil Houlsby在論文Scaling Open-Vocabulary Object Detection中提出。與OWL - ViT類似，OWLv2是一種零樣本的文本條件目標檢測模型，可使用一個或多個文本查詢對圖像進行查詢。

模型日期

2023年6月

模型類型

該模型使用CLIP作為主幹，其中圖像編碼器採用ViT - B/16 Transformer架構，文本編碼器採用掩碼自注意力Transformer。這些編碼器通過對比損失進行訓練，以最大化（圖像，文本）對的相似度。CLIP主幹從頭開始訓練，並與邊界框和類別預測頭一起在目標檢測任務上進行微調。

📚 模型使用

預期用途

本模型旨在作為研究成果供研究社區使用。我們希望該模型能幫助研究人員更好地理解和探索零樣本、文本條件的目標檢測。此外，我們也期望它能用於跨學科研究，探討此類模型的潛在影響，特別是在那些通常需要識別訓練期間標籤不可用的對象的領域。

主要預期用戶

這些模型的主要預期用戶是人工智能研究人員。我們主要設想研究人員將使用該模型來更好地理解計算機視覺模型的魯棒性、泛化能力以及其他特性、偏差和限制。

🔧 數據

模型的CLIP主幹在公開可用的圖像 - 字幕數據上進行訓練。這是通過抓取一些網站並結合常用的現有圖像數據集（如YFCC100M）來完成的。大部分數據來自互聯網抓取，這意味著數據更能代表與互聯網連接最緊密的人群和社會。OWL - ViT的預測頭與CLIP主幹一起在公開可用的目標檢測數據集（如COCO和OpenImages）上進行微調。

BibTeX引用

@misc{minderer2023scaling,
      title={Scaling Open-Vocabulary Object Detection}, 
      author={Matthias Minderer and Alexey Gritsenko and Neil Houlsby},
      year={2023},
      eprint={2306.09683},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}