9 months ago
When comes to training a custom object detection model, TensorFlow object detection API and MMdetection(PyTorch) are two readily available options, I have shown you how to do that even on Google Colab's free GPU resources.
Those two frameworks are easy to use with simple configuration interface and let the framework source code does the heavy lifting. But do you ever wonder how the deep learning object detection algorithms are evolved over the years, their pros and cons?
I find the paper - Recent Advances in Deep Learning for Object Detection a really good answer to this quest. Let me summarize what I have learned, hopefully, elaborate in a more intuitive way.
Text colors: pro/ cons
Detection Paradigms Two-stage Detectors
SPP-net Spatial Pyramid Pooling SPP-net Spatial Pyramid Pooling
Fast RCNN Fast RCNN
ROI pooling layer. The feature extraction, region classificat... ROI pooling layer. The feature extraction, region classification and bounding box regression steps can all be optimized end-to-end, without extra cache space to store features. The proposal generation step still relied on traditional methods such as Selective Search or Edge Boxes.
The whole detection framework could not be optimized in an en... The whole detection framework could not be optimized in an end-to-end manner, making it difficult to obtain global optimal solution.
Spatial Pyramid Pooling (SPP) layer achieved better results a... Spatial Pyramid Pooling (SPP) layer achieved better results and had a significantly faster inference speed. However the training of SPP - net was still multi - stage and thus it could not be optimized end - to - end. SPP layer did not back - propagate gradients to convolutional kernels and thus all the parameters before the SPP layer were frozen, limited learning capability of deep backbone architectures.
Faster R-CNN Faster R-CNN
Region Proposal Network(RPN) The whole framework could be opt... Region Proposal Network(RPN) The whole framework could be optimized in an end-to-end manner on training data. The computation was not shared in the region classification step, where each feature vector still needed to go through a sequence of FC layers separately. used a single deep layer feature map to make the final prediction. This made it difficult to detect objects at different scales.
Region-based Fully Convolutional Networks (R-FCN) Region-based Fully Convolutional Networks (R-FCN)
The detector achieved competitive results compared to Faster ... The detector achieved competitive results compared to Faster RCNN without region-wise fully connected layer operations. Position-Sensitive Score Map.
Feature Pyramid Networks(FPN) Feature Pyramid Networks(FPN)
Enable object detection in feature maps at different scales. Enable object detection in feature maps at different scales. One-stage Detectors
In order to detect multiscale objects, the input image was re... In order to detect multiscale objects, the input image was resized into multiple scales which were fed into the network. Finally, the predictions across all the scales were merged together. However, the training of classifiers and regressors were separated without being jointly optimized.
Divided the whole image into fixed number of grid cells. YOLO... Divided the whole image into a fixed number of grid cells. YOLO faced some challenges: i) it could detect up to only two objects at a given location, which made it difficult to detect small objects and crowded objects. ii) only the last feature map was used for prediction, which was not suitable for predicting objects at multiple scales and aspect ratios
Single-Shot Mulibox Detector (SSD) Single-Shot Mulibox Detector (SSD)
A set of anchors with multiple scales and aspect-ratios were ... A set of anchors with multiple scales and aspect-ratios were generated to discretize the output space of bounding boxes, predicted objects on multiple feature maps. ( multiple scales), hard negative mining. The class imbalance between foreground and background is a severe problem in one - stage detector.
Focal loss which suppressed the gradients of easy negative sa... Focal loss which suppressed the gradients of easy negative samples instead of simply discarding them.used feature pyramid networks to detect multi - scale objects at different levels of feature maps.
Adopted a more powerful deep convolutional backbone. YOLOv2 d... Adopted a more powerful deep convolutional backbone. YOLOv2 defined better anchor priors by k-means clustering from the training data (instead of setting manually). multi-scale training techniques.
Anchor-free object, the goal was to predict keypoints of the ... Anchor-free object, the goal was to predict key points of the bounding box, instead of trying to fit an object to an anchor.
CornerNet CornerNet Backbone Architecture
Conclusion and further reading
Appropriate ordering of the Batch Normalization could further... Appropriate ordering of the Batch Normalization could further perform better than original ResNet. possible to successfully train a network with more than 1000 layers.
Reduced optimization difficulties by introducing shortcut con... Reduced optimization difficulties by introducing shortcut connections. Shortcut connection, it did not fully utilize features from previous layers.
Retained the shallow layer features, and improved information... Retained the shallow layer features, and improved information flow, by concatenating the input with the residual output instead of element-wise addition.
Integrating the advantages of both ResNet and DenseNet. Divid... Integrating the advantages of both ResNet and DenseNet. Divides x channels into two parts, used for dense connection computation and element-wise summation, the result was the concatenated output of the two branches.
Dual Path Network(DPN) Dual Path Network(DPN)
Considerably reduced computation and memory cost while mainta... Considerably reduced computation and memory cost while maintaining comparable classification accuracy. Adopted group convolution layers which sparsely connects feature map channels to reduce computation cost.
Significantly reduced computation cost as well as number of p... Significantly reduced computation cost as well as the number of parameters without significant loss in classification accuracy.
Increasing model width to improve the learning capacity, appl... Increasing model width to improve the learning capacity, applied different scale convolution kernels (1 × 1; 3 × 3 and 5 × 5) on the same feature map in a given layer.
Kept high resolution feature maps for prediction with dilated... Kept high-resolution feature maps for prediction with dilated convolutions to increase receptive fields. Detected objects on multi-scale feature maps.
Hourglass Network Hourglass Network
First appeared in human pose recognition task. First downsamp... First appeared in the human pose recognition task. First downsampled the input image via a sequence of convolutional layer or pooling layer, and upsampled the feature map via deconvolutional operation. To avoid information loss in downsampling stage, skip connections were used between downsampling and upsampling features.
Increasing depth led to better performance, but also led to o... Increasing depth led to better performance, but also led to optimization challenges.
This quick post summarized recent advance in deep learning object detection in three aspects, two-stage detector, one-stage detector and backbone architectures. Next time you are training a custom object detection with a third-party open-source framework, you will feel more confident to select an optimal option for your application by examing their pros and cons.
In the next post, I will finish up what we have left in the paper, namely the proposal generation, feature representation learning, and learning strategy. If you are interested, strongly suggested to give the
paper a read, it will be well worth your time. Share on Facebook