dab-detr-resnet-50开源目标检测模型 - 动态锚框提升收敛速度与检测精度

首页

Dab Detr Resnet 50

由 IDEA-Research 开发

DAB-DETR是一种改进的DETR目标检测模型，通过动态锚框查询机制显著提升训练收敛速度和检测精度

目标检测

Transformers

英语开源协议:Apache-2.0 #动态锚框查询 #目标检测优化 #Transformer架构

下载量 1,590

发布时间 : 5/29/2024

模型简介

基于Transformer的目标检测模型，使用动态锚框作为查询机制，解决传统DETR训练收敛慢的问题

模型特点

动态锚框查询

直接在Transformer解码器中使用框坐标作为查询并逐层更新，显著提升训练收敛速度

显式位置先验

通过框坐标利用位置先验信息，增强查询与特征的相似度匹配

高性能检测

在COCO基准测试中达到45.7% AP的优异性能

模型能力

多目标检测

复杂场景识别

实时物体定位

使用案例

智能监控

视频监控分析

实时检测监控画面中的多类目标物体

准确识别人员、车辆等目标

自动驾驶

道路场景理解

检测道路上的车辆、行人、交通标志等

为自动驾驶系统提供环境感知能力

🚀 模型ID的模型卡片

本模型卡片介绍了一个用于目标检测的 🤗 Transformers 模型，它基于动态锚框的新型查询公式，解决了 DETR 训练收敛慢的问题，在 MS - COCO 基准测试中表现出色。

🚀 快速开始

使用以下代码开始使用该模型：

import torch
import requests

from PIL import Image
from transformers import AutoModelForObjectDetection, AutoImageProcessor

url = 'http://images.cocodataset.org/val2017/000000039769.jpg' 
image = Image.open(requests.get(url, stream=True).raw)

image_processor = AutoImageProcessor.from_pretrained("IDEA-Research/dab-detr-resnet-50")
model = AutoModelForObjectDetection.from_pretrained("IDEA-Research/dab-detr-resnet-50")

inputs = image_processor(images=image, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)

results = image_processor.post_process_object_detection(outputs, target_sizes=torch.tensor([image.size[::-1]]), threshold=0.3)

for result in results:
    for score, label_id, box in zip(result["scores"], result["labels"], result["boxes"]):
        score, label = score.item(), label_id.item()
        box = [round(i, 2) for i in box.tolist()]
        print(f"{model.config.id2label[label]}: {score:.2f} {box}")

这段代码的输出结果如下：

cat: 0.87 [14.7, 49.39, 320.52, 469.28]
remote: 0.86 [41.08, 72.37, 173.39, 117.2]
cat: 0.86 [344.45, 19.43, 639.85, 367.86]
remote: 0.61 [334.27, 75.93, 367.92, 188.81]
couch: 0.59 [-0.04, 1.34, 639.9, 477.09]

✨ 主要特性

提出了一种用于 DETR（DEtection TRansformer）的新型查询公式，使用动态锚框，对 DETR 中查询的作用有了更深入的理解。
直接使用框坐标作为 Transformer 解码器中的查询，并逐层动态更新。
利用框坐标有助于使用显式位置先验来提高查询与特征的相似度，消除 DETR 中训练收敛慢的问题。
允许使用框的宽度和高度信息来调制位置注意力图。
在相同设置下，在 MS - COCO 基准测试中，在类似 DETR 的检测模型中取得了最佳性能。

📦 安装指南

文档未提及安装步骤，故跳过此章节。

💻 使用示例

基础用法

import torch
import requests

from PIL import Image
from transformers import AutoModelForObjectDetection, AutoImageProcessor

url = 'http://images.cocodataset.org/val2017/000000039769.jpg' 
image = Image.open(requests.get(url, stream=True).raw)

image_processor = AutoImageProcessor.from_pretrained("IDEA-Research/dab-detr-resnet-50")
model = AutoModelForObjectDetection.from_pretrained("IDEA-Research/dab-detr-resnet-50")

inputs = image_processor(images=image, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)

results = image_processor.post_process_object_detection(outputs, target_sizes=torch.tensor([image.size[::-1]]), threshold=0.3)

for result in results:
    for score, label_id, box in zip(result["scores"], result["labels"], result["boxes"]):
        score, label = score.item(), label_id.item()
        box = [round(i, 2) for i in box.tolist()]
        print(f"{model.config.id2label[label]}: {score:.2f} {box}")

📚 详细文档

模型详情

image/png

本文提出了一种用于 DETR（DEtection TRansformer）的新型查询公式，使用动态锚框，并对 DETR 中查询的作用有了更深入的理解。这种新公式直接使用框坐标作为 Transformer 解码器中的查询，并逐层动态更新。使用框坐标不仅有助于使用显式位置先验来提高查询与特征的相似度，消除 DETR 中训练收敛慢的问题，还允许我们使用框的宽度和高度信息来调制位置注意力图。这种设计明确了 DETR 中的查询可以实现为以级联方式逐层执行软 ROI 池化。因此，在相同设置下，在 MS - COCO 基准测试中，它在类似 DETR 的检测模型中取得了最佳性能，例如，使用 ResNet50 - DC5 作为骨干网络，在 50 个 epoch 内训练得到的 AP 为 45.7%。我们还进行了广泛的实验来证实我们的分析并验证我们方法的有效性。

模型描述

这是一个已发布在 Hub 上的 🤗 Transformers 模型的卡片，该模型卡片是自动生成的。

属性	详情
开发者	Shilong Liu、Feng Li、Hao Zhang、Xiao Yang、Xianbiao Qi、Hang Su、Jun Zhu、Lei Zhang
资助方	IDEA - Research
分享者	David Hajdu
模型类型	DAB - DETR
许可证	Apache - 2.0

模型来源

仓库：https://github.com/IDEA-Research/DAB-DETR
论文：https://arxiv.org/abs/2201.12329

训练详情

训练数据

DAB - DETR 模型在 COCO 2017 目标检测数据集上进行训练，该数据集分别包含 118k 张训练图像和 5k 张验证图像。

训练过程

遵循 Deformable DETR 和 Conditional DETR 的方法，我们使用 300 个锚框作为查询。我们还选择具有最大分类对数的 300 个预测框和标签进行评估。我们使用 α = 0.25、γ = 2 的焦点损失（Lin 等人，2020）进行分类。在二分图匹配和最终损失计算中使用相同的损失项，但系数不同。在二分图匹配中使用系数为 2.0 的分类损失，而在最终损失中使用系数为 1.0 的分类损失。系数为 5.0 的 L1 损失和系数为 2.0 的 GIOU 损失（Rezatofighi 等人，2019）在匹配和最终损失计算过程中保持一致。所有模型在 16 个 GPU 上进行训练，每个 GPU 处理 1 张图像，并使用 AdamW（Loshchilov & Hutter，2018）进行训练，权重衰减为 10−4。骨干网络和其他模块的学习率分别设置为 10−5 和 10−4。我们训练模型 50 个 epoch，并在 40 个 epoch 后将学习率降低 0.1。所有模型都在 Nvidia A100 GPU 上进行训练。我们使用批量大小为 64 搜索超参数，论文中的所有结果均报告为批量大小为 16。

预处理

图像被调整大小/缩放，使得最短边至少为 480 像素，最多为 800 像素，最长边最多为 1333 像素，并使用 ImageNet 均值（0.485, 0.456, 0.406）和标准差（0.229, 0.224, 0.225）在 RGB 通道上进行归一化。

训练超参数

参数	值
activation_dropout	`0.0`
activation_function	`prelu`
attention_dropout	`0.0`
auxiliary_loss	`false`
backbone	`resnet50`
bbox_cost	`5`
bbox_loss_coefficient	`5`
class_cost	`2`
cls_loss_coefficient	`2`
decoder_attention_heads	`8`
decoder_ffn_dim	`2048`
decoder_layers	`6`
dropout	`0.1`
encoder_attention_heads	`8`
encoder_ffn_dim	`2048`
encoder_layers	`6`
focal_alpha	`0.25`
giou_cost	`2`
giou_loss_coefficient	`2`
hidden_size	`256`
init_std	`0.02`
init_xavier_std	`1.0`
initializer_bias_prior_prob	`null`
keep_query_pos	`false`
normalize_before	`false`
num_hidden_layers	`6`
num_patterns	`0`
num_queries	`300`
query_dim	`4`
random_refpoints_xy	`false`
sine_position_embedding_scale	`null`
temperature_height	`20`
temperature_width	`20`

评估

image/png

模型架构和目标

image/png

DAB - DETR 概述。我们使用 CNN 骨干网络提取图像空间特征，然后使用 Transformer 编码器对 CNN 特征进行细化。然后，将包括位置查询（锚框）和内容查询（解码器嵌入）的双查询输入到解码器中，以探测与锚框对应的对象，并与内容查询具有相似的模式。双查询逐层更新，以逐渐接近目标真实对象。最终解码器层的输出用于通过预测头预测带有标签和框的对象，然后进行二分图匹配以计算损失，如 DETR 中所示。

🔧 技术细节

文档中关于技术细节的描述已在前面详细文档部分体现，故此处不再重复。

📄 许可证

本模型使用 Apache - 2.0 许可证。

📚 引用

BibTeX：

@inproceedings{
  liu2022dabdetr,
  title={{DAB}-{DETR}: Dynamic Anchor Boxes are Better Queries for {DETR}},
  author={Shilong Liu and Feng Li and Hao Zhang and Xiao Yang and Xianbiao Qi and Hang Su and Jun Zhu and Lei Zhang},
  booktitle={International Conference on Learning Representations},
  year={2022},
  url={https://openreview.net/forum?id=oMI9PjOb9Jl}
}