Recent Advances in Deep Learning for Object Detection - Part 1

(Comments)

advance

When comes to training a custom object detection model, TensorFlow object detection API and MMdetection(PyTorch) are two readily available options, I have shown you how to do that even on Google Colab's free GPU resources. 

Those two frameworks are easy to use with simple configuration interface and let the framework source code does the heavy lifting. But do you ever wonder how the deep learning object detection algorithms are evolved over the years, their pros and cons?

I find the paper - Recent Advances in Deep Learning for Object Detection a really good answer to this quest. Let me summarize what I have learned, hopefully, elaborate in a more intuitive way.

Text colors: pro/cons

Detection Paradigms

Two-stage Detectors

two-stage

页-1 工作表.13 R-CNN R-CNN 工作表.14 SPP-net Spatial Pyramid Pooling SPP-net Spatial Pyramid Pooling 工作表.15 Fast RCNN Fast RCNN 工作表.16 ROI pooling layer. The feature extraction, region classificat... ROI pooling layer. The feature extraction, region classification and bounding box regression steps can all be optimized end-to-end, without extra cache space to store features.The proposal generation step still relied on traditional methods such as Selective Search or Edge Boxes. 工作表.17 The whole detection framework could not be optimized in an en... The whole detection framework could not be optimized in an end-to-end manner, making it difficult to obtain global optimal solution. 工作表.18 Spatial Pyramid Pooling (SPP) layer achieved better results a... Spatial Pyramid Pooling (SPP) layer achieved better results and had a significantly faster inference speed. However the training of SPP-net was still multi-stage and thus it could not be optimized end-to-end.SPP layer did not back-propagate gradients to convolutional kernels and thus all the parameters before the SPP layer were frozen, limited learning capability of deep backbone architectures. 工作表.19 Faster R-CNN Faster R-CNN 工作表.20 Region Proposal Network(RPN) The whole framework could be opt... Region Proposal Network(RPN)The whole framework could be optimized in an end-to-end manner on training data. The computation was not shared in the region classification step, where each feature vector still needed to go through a sequence of FC layers separately.used a single deep layer feature map to make the final prediction. This made it difficult to detect objects at different scales. 工作表.21 Region-based Fully Convolutional Networks (R-FCN) Region-based Fully Convolutional Networks (R-FCN) 工作表.22 The detector achieved competitive results compared to Faster ... The detector achieved competitive results compared to Faster RCNN without region-wise fully connected layer operations.Position-Sensitive Score Map. 工作表.23 Feature Pyramid Networks(FPN) Feature Pyramid Networks(FPN) 工作表.24 Enable object detection in feature maps at different scales. Enable object detection in feature maps at different scales.

One-stage Detectors

one-stage

adv2 工作表.1 OverFeat OverFeat 工作表.2 In order to detect multiscale objects, the input image was re... In order to detect multiscale objects, the input image was resized into multiple scales which were fed into the network. Finally, the predictions across all the scales were merged together.However, the training of classifiers and regressors were separated without being jointly optimized. 工作表.3 YOLO YOLO 工作表.4 Divided the whole image into fixed number of grid cells. YOLO... Divided the whole image into a fixed number of grid cells.YOLO faced some challenges: i) it could detect up to only two objects at a given location, which made it difficult to detect small objects and crowded objects. ii) only the last feature map was used for prediction, which was not suitable for predicting objects at multiple scales and aspect ratios 工作表.5 Single-Shot Mulibox Detector (SSD) Single-Shot Mulibox Detector (SSD) 工作表.6 A set of anchors with multiple scales and aspect-ratios were ... A set of anchors with multiple scales and aspect-ratios were generated to discretize the output space of bounding boxes, predicted objects on multiple feature maps. (multiple scales), hard negative mining.The class imbalance between foreground and background is a severe problem in one-stage detector. 工作表.7 RetinaNet RetinaNet 工作表.8 Focal loss which suppressed the gradients of easy negative sa... Focal loss which suppressed the gradients of easy negative samples instead of simply discarding them.used feature pyramid networks to detect multi-scale objects at different levels of feature maps. 工作表.9 YOLOv2 YOLOv2 工作表.10 Adopted a more powerful deep convolutional backbone. YOLOv2 d... Adopted a more powerful deep convolutional backbone. YOLOv2 defined better anchor priors by k-means clustering from the training data (instead of setting manually). multi-scale training techniques. 工作表.11 Anchor-free object, the goal was to predict keypoints of the ... Anchor-free object, the goal was to predict key points of the bounding box, instead of trying to fit an object to an anchor. 工作表.12 CornerNet CornerNet

Backbone Architecture

adv3 工作表.1 ResNet ResNet 工作表.2 ResNet-v2 ResNet-v2 工作表.3 Appropriate ordering of the Batch Normalization could further... Appropriate ordering of the Batch Normalization could further perform better than original ResNet. possible to successfully train a network with more than 1000 layers. 工作表.4 Reduced optimization difficulties by introducing shortcut con... Reduced optimization difficulties by introducing shortcut connections. Shortcut connection, it did not fully utilize features from previous layers. 工作表.5 DenseNet DenseNet 工作表.6 Retained the shallow layer features, and improved information... Retained the shallow layer features, and improved information flow, by concatenating the input with the residual output instead of element-wise addition. 工作表.7 Integrating the advantages of both ResNet and DenseNet. Divid... Integrating the advantages of both ResNet and DenseNet.Divides x channels into two parts, used for dense connection computation and element-wise summation, the result was the concatenated output of the two branches. 工作表.8 Dual Path Network(DPN) Dual Path Network(DPN) 工作表.23 ResNeXt ResNeXt 工作表.24 Considerably reduced computation and memory cost while mainta... Considerably reduced computation and memory cost while maintaining comparable classification accuracy. Adopted group convolution layers which sparsely connects feature map channels to reduce computation cost. 工作表.25 MobileNet MobileNet 工作表.26 Significantly reduced computation cost as well as number of p... Significantly reduced computation cost as well as the number of parameters without significant loss in classification accuracy. 工作表.27 Increasing model width to improve the learning capacity, appl... Increasing model width to improve the learning capacity, applied different scale convolution kernels (1 × 1; 3 × 3 and 5 × 5) on the same feature map in a given layer. 工作表.28 GoogleNet GoogleNet 工作表.29 DetNet DetNet 工作表.30 Kept high resolution feature maps for prediction with dilated... Kept high-resolution feature maps for prediction with dilated convolutions to increase receptive fields. Detected objects on multi-scale feature maps. 工作表.31 Hourglass Network Hourglass Network 工作表.32 First appeared in human pose recognition task. First downsamp... First appeared in the human pose recognition task.First downsampled the input image via a sequence of convolutional layer or pooling layer, and upsampled the feature map via deconvolutional operation. To avoid information loss in downsampling stage, skip connections were used between downsampling and upsampling features. 工作表.33 VGG16 VGG16 工作表.34 Increasing depth led to better performance, but also led to o... Increasing depth led to better performance, but also led to optimization challenges. Conclusion and further reading

This quick post summarized recent advance in deep learning object detection in three aspects, two-stage detector, one-stage detector and backbone architectures. Next time you are training a custom object detection with a third-party open-source framework, you will feel more confident to select an optimal option for your application by examing their pros and cons.

In the next post, I will finish up what we have left in the paper, namely the proposal generation, feature representation learning, and learning strategy. If you are interested, strongly suggested to give the paper a read, it will be well worth your time.

Currently unrated

Comments