đ Model Card for Model ID
This model card provides a comprehensive overview of a đ¤ transformers model for object detection. It details the model's architecture, training process, evaluation results, and more.
đ Table of Contents
- Model Details
- Model Sources
- đ Quick Start
- Training Details
- Evaluation
- Model Architecture and Objective
- Citation
đ Model Details

In this paper, we introduce a novel query formulation using dynamic anchor boxes for DETR (DEtection TRansformer) and offer a deeper understanding of the role of queries in DETR. This new approach directly uses box coordinates as queries in Transformer decoders and updates them dynamically layer by layer. Using box coordinates not only leverages explicit positional priors to enhance the query - to - feature similarity and resolve the slow training convergence issue in DETR but also enables us to modulate the positional attention map using box width and height information. This design makes it clear that queries in DETR can be implemented as performing soft ROI pooling layer by layer in a cascade manner. As a result, it achieves the best performance on the MS - COCO benchmark among DETR - like detection models under the same setting, e.g., AP 45.7% using ResNet50 - DC5 as the backbone trained in 50 epochs. We also conducted extensive experiments to validate our analysis and confirm the effectiveness of our methods.
Model Description
This is the model card of a đ¤ transformers model that has been pushed on the Hub. This model card has been automatically generated.
- Developed by: Shilong Liu, Feng Li, Hao Zhang, Xiao Yang, Xianbiao Qi, Hang Su, Jun Zhu, Lei Zhang
- Funded by: IDEA - Research
- Shared by: David Hajdu
- Model type: DAB - DETR
- License: Apache - 2.0
Model Sources
- Repository: https://github.com/IDEA - Research/DAB - DETR
- Paper: https://arxiv.org/abs/2201.12329
đ Quick Start
Use the following code to start using the model:
import torch
import requests
from PIL import Image
from transformers import AutoModelForObjectDetection, AutoImageProcessor
url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)
image_processor = AutoImageProcessor.from_pretrained("IDEA-Research/dab-detr-resnet-50-dc5")
model = AutoModelForObjectDetection.from_pretrained("IDEA-Research/dab-detr-resnet-50-dc5")
inputs = image_processor(images=image, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
results = image_processor.post_process_object_detection(outputs, target_sizes=torch.tensor([image.size[::-1]]), threshold=0.3)
for result in results:
for score, label_id, box in zip(result["scores"], result["labels"], result["boxes"]):
score, label = score.item(), label_id.item()
box = [round(i, 2) for i in box.tolist()]
print(f"{model.config.id2label[label]}: {score:.2f} {box}")
This should output:
cat: 0.89 [344.17, 20.93, 640.53, 371.3]
cat: 0.88 [16.19, 53.44, 315.77, 469.12]
remote: 0.87 [40.35, 73.28, 175.18, 117.59]
couch: 0.60 [0.1, 0.88, 640.08, 476.5]
remote: 0.55 [333.52, 77.34, 369.16, 191.01]
đĻ Training Details
Training Data
The DAB - DETR model was trained on COCO 2017 object detection, a dataset consisting of 118k/5k annotated images for training/validation respectively.
Training Procedure
Following Deformable DETR and Conditional DETR, we use 300 anchors as queries. We also select 300 predicted boxes and labels with the largest classification logits for evaluation. We use focal loss (Lin et al., 2020) with Îą = 0.25, Îŗ = 2 for classification. The same loss terms are used in bipartite matching and final loss calculation, but with different coefficients. Classification loss with a coefficient of 2.0 is used in bipartite matching and 1.0 in the final loss. L1 loss with a coefficient of 5.0 and GIOU loss (Rezatofighi et al., 2019) with a coefficient of 2.0 are consistent in both the matching and the final loss calculation procedures. All models are trained on 16 GPUs with 1 image per GPU, and AdamW (Loshchilov & Hutter, 2018) is used for training with weight decay 10â4. The learning rates for the backbone and other modules are set to 10â5 and 10â4 respectively. We train our models for 50 epochs and reduce the learning rate by 0.1 after 40 epochs. All models are trained on Nvidia A100 GPUs. We search hyperparameters with a batch size of 64, and all results in our paper are reported with a batch size of 16.
Preprocessing
Images are resized/rescaled such that the shortest side is at least 480 and at most 800 pixels, and the long side is at most 1333 pixels. They are then normalized across the RGB channels with the ImageNet mean (0.485, 0.456, 0.406) and standard deviation (0.229, 0.224, 0.225).
Training Hyperparameters
Property |
Details |
activation_dropout |
0.0 |
activation_function |
prelu |
attention_dropout |
0.0 |
auxiliary_loss |
false |
backbone |
resnet50 |
bbox_cost |
5 |
bbox_loss_coefficient |
5 |
class_cost |
2 |
cls_loss_coefficient |
2 |
decoder_attention_heads |
8 |
decoder_ffn_dim |
2048 |
decoder_layers |
6 |
dropout |
0.1 |
encoder_attention_heads |
8 |
encoder_ffn_dim |
2048 |
encoder_layers |
6 |
focal_alpha |
0.25 |
giou_cost |
2 |
giou_loss_coefficient |
2 |
hidden_size |
256 |
init_std |
0.02 |
init_xavier_std |
1.0 |
initializer_bias_prior_prob |
null |
keep_query_pos |
false |
normalize_before |
false |
num_hidden_layers |
6 |
num_patterns |
0 |
num_queries |
300 |
query_dim |
4 |
random_refpoints_xy |
false |
sine_position_embedding_scale |
null |
temperature_height |
20 |
temperature_width |
20 |
đ Evaluation

đ§ Model Architecture and Objective

Overview of DAB - DETR. We extract image spatial features using a CNN backbone followed by Transformer encoders to refine the CNN features. Then dual queries, including positional queries (anchor boxes) and content queries (decoder embeddings), are fed into the decoder to detect the objects corresponding to the anchors and having similar patterns with the content queries. The dual queries are updated layer by layer to gradually approach the target ground - truth objects. The outputs of the final decoder layer are used to predict the objects with labels and boxes by prediction heads, and then a bipartite graph matching is performed to calculate the loss as in DETR.
đ Citation
BibTeX:
@inproceedings{
liu2022dabdetr,
title={{DAB}-{DETR}: Dynamic Anchor Boxes are Better Queries for {DETR}},