deformable-detr-box-supervised開源目標檢測模型

首頁

Deformable Detr Box Supervised

由facebook開發

Deformable DETR是基於Transformer架構的目標檢測模型，在LVIS數據集上訓練，支持1203個類別的物體檢測。

目標檢測

Transformers

開源協議:Apache-2.0 #多類別目標檢測 #Transformer架構 #LVIS數據集

下載量 193

發布時間 : 2/27/2023

模型概述

該模型採用Deformable DETR架構，結合卷積骨幹網絡和Transformer編碼器-解碼器，通過對象查詢機制實現高效的目標檢測。

模型特點

大規模類別檢測

支持LVIS數據集的1203個物體類別檢測，包括稀有類別。

高效Transformer架構

採用Deformable DETR架構，通過可變形注意力機制提高計算效率。

端到端訓練

無需複雜的後處理，直接輸出檢測結果。

模型能力

多類別物體檢測

邊界框預測

大規模視覺識別

使用案例

通用物體檢測

場景理解

檢測複雜場景中的多種物體

在LVIS數據集上達到31.7 mAP

稀有物體檢測

識別不常見的物體類別

稀有類別mAP達到21.4

🚀 可變形DETR模型（基於LVIS數據集訓練）

本項目基於LVIS數據集（包含1203個類別）訓練了可變形檢測變換器（Deformable DETR）模型。該模型由Zhou等人在論文 Detecting Twenty-thousand Classes using Image-level Supervision 中提出，並首次在此代碼庫中發佈。本模型對應於原代碼庫中發佈的 “Box-Supervised_DeformDETR_R50_4x” 檢查點。

免責聲明

發佈Detic的團隊並未為此模型編寫模型卡片，此模型卡片由Hugging Face團隊編寫。

🚀 快速開始

你可以使用此原始模型進行目標檢測。可在模型中心查找所有可用的可變形DETR模型。

✨ 主要特性

模型架構：DETR模型是一個帶有卷積骨幹網絡的編碼器 - 解碼器變換器。在解碼器輸出之上添加了兩個頭部以進行目標檢測：一個用於類別標籤的線性層和一個用於邊界框的多層感知機（MLP）。模型使用所謂的目標查詢來檢測圖像中的目標。每個目標查詢在圖像中尋找特定的目標。對於COCO數據集，目標查詢的數量設置為100。
訓練損失：模型使用 “二分匹配損失” 進行訓練。將N = 100個目標查詢的預測類別和邊界框與真實標註進行比較，標註會填充到相同的長度N（因此，如果圖像僅包含4個目標，96個標註的類別將為 “無目標”，邊界框為 “無邊界框”）。使用匈牙利匹配算法在N個查詢和N個標註之間創建最優的一對一映射。然後，使用標準的交叉熵（用於類別）和L1與廣義IoU損失的線性組合（用於邊界框）來優化模型的參數。

模型架構圖

📦 安裝指南

文檔未提及安裝步驟，故跳過此章節。

💻 使用示例

基礎用法

from transformers import AutoImageProcessor, DeformableDetrForObjectDetection
import torch
from PIL import Image
import requests

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

processor = AutoImageProcessor.from_pretrained("facebook/deformable-detr-box-supervised")
model = DeformableDetrForObjectDetection.from_pretrained("facebook/deformable-detr-box-supervised")

inputs = processor(images=image, return_tensors="pt")
outputs = model(**inputs)

# convert outputs (bounding boxes and class logits) to COCO API
# let's only keep detections with score > 0.7
target_sizes = torch.tensor([image.size[::-1]])
results = processor.post_process_object_detection(outputs, target_sizes=target_sizes, threshold=0.7)[0]

for score, label, box in zip(results["scores"], results["labels"], results["boxes"]):
    box = [round(i, 2) for i in box.tolist()]
    print(
            f"Detected {model.config.id2label[label.item()]} with confidence "
            f"{round(score.item(), 3)} at location {box}"
    )

📚 詳細文檔

評估結果

此模型在LVIS數據集上實現了31.7的邊界框平均精度均值（box mAP）和21.4的稀有類平均精度均值（mAP）。

BibTeX引用信息

@misc{https://doi.org/10.48550/arxiv.2010.04159,
  doi = {10.48550/ARXIV.2010.04159},
  url = {https://arxiv.org/abs/2010.04159}, 
  author = {Zhu, Xizhou and Su, Weijie and Lu, Lewei and Li, Bin and Wang, Xiaogang and Dai, Jifeng},
  keywords = {Computer Vision and Pattern Recognition (cs.CV), FOS: Computer and information sciences, FOS: Computer and information sciences},
  title = {Deformable DETR: Deformable Transformers for End-to-End Object Detection},
  publisher = {arXiv},
  year = {2020},
  copyright = {arXiv.org perpetual, non-exclusive license}
}