End-to-End Object Detection with Transformers

DeTR stands for DEtection TRansformer is a set-based global loss that forces unique predictions via bipartite matching and a transformer encoder-decoder architecture.

Background

The goal of object detection is to:

predict a set of bounding boxes
categorize labels for each object of interest

The methods used are:

Proposals: Generate sets of possible regions, then by using techniques like NMS to locate on one most probable region.
Anchors: Used as reference point to predict the positional adjustments for each anchor relative to the true location.
Window centers: Non-anchor based methods.

The methods used before are strongly restricted by the postprocessing steps. This essay proposed an end-to-end direct prediction method.

Structure

DETR Structure

The matching part is realized through bipartite matching algorithms. The new model requires extra-long training schedule and benefits from auxiliary decoding losses in the transformer.

Object detection set prediction loss

To find a bipartite matching between these two sets we search for a permutation of $N$ elements $σ \in S_{N}$ with the lowest cost:

\hat{σ} = \arg min_{σ \in S_{N}} \sum_{i} L_{match} (y_{i}, {\hat{y}}_{σ (i)})

where $L_{match} (y_{i}, {\hat{y}}_{σ (i)})$ is a pair-wise matching cost between ground truth $y_{i}$ and a prediction with index $σ (i)$ . This optimal assignment is computed efficiently with the Hungarian algorithm. The definition of the matching cost in the essay is $- 1_{{c_{i} \neq \emptyset}} p_{σ (i)} (c_{i}) + 1_{{c_{i} \neq \emptyset}} L_{box} (b_{i}, {\hat{b}}_{σ (i)})$ .

Then we compute the Hungarian loss, which is like:

L_{Hungarian} (y, \hat{y}) = \sum_{i = 1}^{N} [- \log p_{σ (i)} (c_{i}) + 1_{{c_{i} \neq \emptyset}} L_{box} (b_{i}, {\hat{b}}_{σ (i)})]

While such approach simplify the implementation it poses an issue with relative scaling of the loss. To mitigate this issue a linear combination of the $l_{1}$ loss and the generalized $IoU$ loss.

Structure

Backbone

A conventional CNN backbone with $C = 2048, H = \frac{H_{0}}{32}, W = \frac{W_{0}}{32}$ .

Transformer Encoder

$1 \times 1$ convolution: From dimension $C$ to $d$ .
Collapse the spatial dimensions of $z_{0}$ into one dimension, resulting in a $d \times H W$ feature map.
Each encoder layer has a standard architecture and consists of a multi-head self-attention module and a feed forward network (FFN).
Supplement it with fixed positional encodings.

Transformer Decoder

The decoder follows the standard architecture of the transformer, transforming $N$ embeddings of size $d$ using multi-headed self- and encoder-decoder attention mechanisms.

The difference is DETR decodes the $N$ objects in parallel at each decoder layer.

Extensions

DETR is straightforward to implement and has a flexible architecture that is easily extensible to panoptic segmentation, with competitive results. In addition, it achieves significantly better performance on large objects than Faster R-CNN, likely thanks to the processing of global information performed by the self-attention.

Reference

Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., & Zagoruyko, S. (2020). End-to-End Object Detection with Transformers (arXiv:2005.12872). arXiv. http://arxiv.org/abs/2005.12872
DETR 论文精读【论文精读】

End-to-End Object Detection with Transformers ​

Background ​

Structure ​

Object detection set prediction loss ​

Backbone ​

Transformer Encoder ​

Transformer Decoder ​

Extensions ​

Reference ​

End-to-End Object Detection with Transformers

Background

Structure

Object detection set prediction loss

Backbone

Transformer Encoder

Transformer Decoder

Extensions

Reference