DAB-DETR-ResNet-50-DC5 Open-Source Object Detection Model - Enhance Detection Performance and Training Convergence Speed

Dab Detr Resnet 50 Dc5

Developed by IDEA-Research

DAB-DETR is an improved DETR model that uses dynamic anchor boxes as queries, significantly enhancing object detection performance and training convergence speed.

Object Detection

Transformers

EnglishOpen Source License:Apache-2.0 #Dynamic Anchor Box Detection #Transformer Object Detection #High-Precision Localization

Downloads 126

Release Time : 9/10/2024

Model Overview

This model employs dynamic anchor boxes as queries for the Transformer decoder, updating box coordinates layer by layer to effectively utilize positional prior information, addressing the slow training convergence issue of traditional DETR.

Model Features

Dynamic Anchor Box Queries

Uses box coordinates as queries and dynamically updates them layer by layer, significantly improving training convergence speed.

Positional Attention Adjustment

Leverages box width and height information to adjust positional attention maps, enhancing detection accuracy.

Soft ROI Pooling

Queries perform soft ROI pooling layer by layer in a cascading manner, optimizing feature extraction.

Model Capabilities

Image Object Detection

Multi-Object Recognition

Bounding Box Prediction

Use Cases

Computer Vision

General Object Detection

Detects and localizes multiple objects in complex scenes.

Achieves 45.7% AP on the COCO dataset.

Intelligent Surveillance

Real-time detection of multiple target objects in surveillance videos.

🚀 Model Card for Model ID

This model card provides a comprehensive overview of a 🤗 transformers model for object detection. It details the model's architecture, training process, evaluation results, and more.

🔍 Model Details

image/png

In this paper, we introduce a novel query formulation using dynamic anchor boxes for DETR (DEtection TRansformer) and offer a deeper understanding of the role of queries in DETR. This new approach directly uses box coordinates as queries in Transformer decoders and updates them dynamically layer by layer. Using box coordinates not only leverages explicit positional priors to enhance the query - to - feature similarity and resolve the slow training convergence issue in DETR but also enables us to modulate the positional attention map using box width and height information. This design makes it clear that queries in DETR can be implemented as performing soft ROI pooling layer by layer in a cascade manner. As a result, it achieves the best performance on the MS - COCO benchmark among DETR - like detection models under the same setting, e.g., AP 45.7% using ResNet50 - DC5 as the backbone trained in 50 epochs. We also conducted extensive experiments to validate our analysis and confirm the effectiveness of our methods.

Model Description

This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.

Developed by: Shilong Liu, Feng Li, Hao Zhang, Xiao Yang, Xianbiao Qi, Hang Su, Jun Zhu, Lei Zhang
Funded by: IDEA - Research
Shared by: David Hajdu
Model type: DAB - DETR
License: Apache - 2.0

Model Sources

Repository: https://github.com/IDEA - Research/DAB - DETR
Paper: https://arxiv.org/abs/2201.12329

🚀 Quick Start

Use the following code to start using the model:

import torch
import requests

from PIL import Image
from transformers import AutoModelForObjectDetection, AutoImageProcessor

url = 'http://images.cocodataset.org/val2017/000000039769.jpg' 
image = Image.open(requests.get(url, stream=True).raw)

image_processor = AutoImageProcessor.from_pretrained("IDEA-Research/dab-detr-resnet-50-dc5")
model = AutoModelForObjectDetection.from_pretrained("IDEA-Research/dab-detr-resnet-50-dc5")

inputs = image_processor(images=image, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)

results = image_processor.post_process_object_detection(outputs, target_sizes=torch.tensor([image.size[::-1]]), threshold=0.3)

for result in results:
    for score, label_id, box in zip(result["scores"], result["labels"], result["boxes"]):
        score, label = score.item(), label_id.item()
        box = [round(i, 2) for i in box.tolist()]
        print(f"{model.config.id2label[label]}: {score:.2f} {box}")

This should output:

cat: 0.89 [344.17, 20.93, 640.53, 371.3]
cat: 0.88 [16.19, 53.44, 315.77, 469.12]
remote: 0.87 [40.35, 73.28, 175.18, 117.59]
couch: 0.60 [0.1, 0.88, 640.08, 476.5]
remote: 0.55 [333.52, 77.34, 369.16, 191.01]

📦 Training Details

Training Data

The DAB - DETR model was trained on COCO 2017 object detection, a dataset consisting of 118k/5k annotated images for training/validation respectively.

Training Procedure

Following Deformable DETR and Conditional DETR, we use 300 anchors as queries. We also select 300 predicted boxes and labels with the largest classification logits for evaluation. We use focal loss (Lin et al., 2020) with α = 0.25, γ = 2 for classification. The same loss terms are used in bipartite matching and final loss calculation, but with different coefficients. Classification loss with a coefficient of 2.0 is used in bipartite matching and 1.0 in the final loss. L1 loss with a coefficient of 5.0 and GIOU loss (Rezatofighi et al., 2019) with a coefficient of 2.0 are consistent in both the matching and the final loss calculation procedures. All models are trained on 16 GPUs with 1 image per GPU, and AdamW (Loshchilov & Hutter, 2018) is used for training with weight decay 10−4. The learning rates for the backbone and other modules are set to 10−5 and 10−4 respectively. We train our models for 50 epochs and reduce the learning rate by 0.1 after 40 epochs. All models are trained on Nvidia A100 GPUs. We search hyperparameters with a batch size of 64, and all results in our paper are reported with a batch size of 16.

Preprocessing

Images are resized/rescaled such that the shortest side is at least 480 and at most 800 pixels, and the long side is at most 1333 pixels. They are then normalized across the RGB channels with the ImageNet mean (0.485, 0.456, 0.406) and standard deviation (0.229, 0.224, 0.225).

Training Hyperparameters

Property	Details
activation_dropout	`0.0`
activation_function	`prelu`
attention_dropout	`0.0`
auxiliary_loss	`false`
backbone	`resnet50`
bbox_cost	`5`
bbox_loss_coefficient	`5`
class_cost	`2`
cls_loss_coefficient	`2`
decoder_attention_heads	`8`
decoder_ffn_dim	`2048`
decoder_layers	`6`
dropout	`0.1`
encoder_attention_heads	`8`
encoder_ffn_dim	`2048`
encoder_layers	`6`
focal_alpha	`0.25`
giou_cost	`2`
giou_loss_coefficient	`2`
hidden_size	`256`
init_std	`0.02`
init_xavier_std	`1.0`
initializer_bias_prior_prob	`null`
keep_query_pos	`false`
normalize_before	`false`
num_hidden_layers	`6`
num_patterns	`0`
num_queries	`300`
query_dim	`4`
random_refpoints_xy	`false`
sine_position_embedding_scale	`null`
temperature_height	`20`
temperature_width	`20`

📊 Evaluation

image/png

🔧 Model Architecture and Objective

image/png

Overview of DAB - DETR. We extract image spatial features using a CNN backbone followed by Transformer encoders to refine the CNN features. Then dual queries, including positional queries (anchor boxes) and content queries (decoder embeddings), are fed into the decoder to detect the objects corresponding to the anchors and having similar patterns with the content queries. The dual queries are updated layer by layer to gradually approach the target ground - truth objects. The outputs of the final decoder layer are used to predict the objects with labels and boxes by prediction heads, and then a bipartite graph matching is performed to calculate the loss as in DETR.

📄 Citation

BibTeX:

@inproceedings{
  liu2022dabdetr,
  title={{DAB}-{DETR}: Dynamic Anchor Boxes are Better Queries for {DETR}},

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご