4 years, 3 months ago
In the second part of the Recent Advances in Deep Learning for Object Detection series, we will summarize three aspects of object detection,
proposal generation, feature representation learning, and learning strategy. Proposal Generation
A proposal generator generates a set of rectangle bounding boxes, which are potential objects.
Traditional Computer Vision Methods Traditional Computer Vision Methods
i) computing the ’objectness score’ of a candidate box; ii) m... i) computing the ’objectness score’ of a candidate box; ii) merging super-pixels from original images; iii) generating multiple foreground and background segments; primary advantage of these traditional computer vision methods is that they are very simple and can generate proposals with high recall. Mainly based on low-level visual cues such as color or edges. They cannot be jointly optimized with the whole detection pipeline. Thus they are unable to exploit the power of large scale datasets to improve representation learning.
Anchor-based Methods Anchor- based Methods
Region Proposal Network (RPN) Region Proposal Network (RPN)
SSD - multi-scale anchors SSD - multi-scale anchors
256−dimensional feature vector was extracted from each anchor... 256−dimensional feature vector was extracted from each anchor and was fed into two sibling branches - classification layer and regression layer. First evaluated whether the anchor proposal was foreground or background and performed the categorical classification in the next stage.
Assigned categorical probabilities to each anchor proposal. Assigned categorical probabilities to each anchor proposal.
Single Shot Scaleinvariant Face Detector (S3FD) Single Shot Scaleinvariant Face Detector (S3FD)
Based on SSD with carefully designed anchors to match the obj... Based on SSD with carefully designed anchors to match the objects. According to the effective receptive field of different feature maps, different anchor priors were designed.
Dimension-Decomposition Region Proposal Network (DeRPN) Dimension- Decomposition Region Proposal Network (DeRPN)
Used an anchor string mechanism to independently match object... Used an anchor string mechanism to independently match objects width and height. This helped match objects with large scale variance and reduced the searching space.
Single-Shot Refinement Neural Network (RefineDet) Single-Shot Refinement Neural Network (RefineDet)
Refined the manually defined anchors in two steps. Significan... Refined the manually defined anchors in two steps. Significantly improved the anchor quality and final prediction accuracy in a data-driven manner.
Cascade R-CNN Cascade R-CNN
Refining proposals in a cascaded way. Refining proposals in a cascaded way.
Improvement compared to other manually defined methods but th... The improvement compared to other manually defined methods but the customized anchors were still designed manually.
Keypoints-based Methods Keypoints- based Methods
Modeled the distribution of being one of the 4 corner types o... Modeled the distribution of being one of the 4 corner types of objects. This corner-based algorithm eliminated the design of anchors and became a more effective method to produce high-quality proposals.
Modeled information of top-left and bottom-right corners. Nov... Modeled information of top-left and bottom-right corners. Novel feature embedding methods and corner pooling layer to correctly match keypoints belonging to the same objects, obtaining state-of-the-art results on public benchmarks.
feature-selection-anchor-free (FSAF) feature-selection- anchor-free (FSAF)
Combined the idea of center-based methods and corner-based me... Combined the idea of center-based methods and corner- based methods. First predicted bounding boxes by pairs of corners, and then predicted center probabilities of the initial prediction to reject easy negatives.
An online feature selection block is applied to train multile... An online feature selection block is applied to train multilevel center-based branches attached in each level of the feature pyramid.
Predicted two values: zoom indicator and adjacency scores. Zo... Predicted two values: zoom indicator and adjacency scores. Zoom indicator determined whether to further divide this region which may contain smaller objects and adjacency scores denoted its objectness. Better at matching sparse and small objects compared to RPN’s anchor-object matching approach.
The anchor priors are manually designed with multiple scales ... The anchor priors are manually designed with multiple scales and aspect ratios in a heuristic manner.
Other Methods Other Methods Feature Representation Learning
Three categories: multi-scale feature learning, contextual reasoning, and deformable feature learning.
Multi-scale feature learning Multi-scale feature learning
Contextual reasoning Contextual reasoning
deformable feature learning deformable feature learning
Image Pyramid Image Pyramid
Resize input images into a number of different scales (Image ... Resize input images into a number of different scales (Image Pyramid) and to train multiple detectors. Each of which is responsible for a certain range of scales. Examples: Scale Normalization for Image Pyramids (SNIP).
Integrated Features Integrated Features
Construct a single feature map by combining features in multi... Construct a single feature map by combining features in multiple layers and making final predictions based on the newly constructed map. Examples: Inside-Outside Network (ION), HyperNet, Multi- scale Location-aware Kernel Representation (MLKP).
Prediction Pyramid Prediction Pyramid
Predictions were made from multiple layers, where each layer ... Predictions were made from multiple layers, where each layer was responsible for a certain scale of objects. Examples: SSD, Receptive Field Block Net (RFBNet).
Feature Pyramid Feature Pyramid
Combine the advantage of Integrated Features and Prediction P... Combine the advantage of Integrated Features and Prediction Pyramid. Example: Feature Pyramid Network(FPN).
Region Feature Encoding Region Feature Encoding
ROI Pooling ROI Pooling
Extracted features from the down-sampled feature map and as a... Extracted features from the down-sampled feature map and as a result struggled to handle small objects.
ROI Warping ROI Warping
Encoded region features via bilinear interpolation. Due to th... The encoded region features via bilinear interpolation. Due to the downsampling operation in DCNN, there can be a misalignment of the object position in the original image and the downsampled feature maps.
ROI Align ROI Align
Addressed the quantization issue by bilinear interpolation at... Addressed the quantization issue by bilinear interpolation at fractionally sampled positions within each grid.
Precise ROI Pooing (PrROI Pooling) Precise ROI Pooing (PrROI Pooling)
Avoided any quantization of coordinates and had a continuous ... Avoided any quantization of coordinates and had a continuous gradient on bounding box coordinates.
Position Sensitive ROI Pooing (PSROI Pooling) Position Sensitive ROI Pooing (PSROI Pooling)
Enhance spatial information of the downsampled region features Enhance spatial information of the downsampled region features.
Combining outputs generated from both ROI Pooling layer and P... Combining outputs generated from both ROI Pooling layer and PSROI Pooling layer. ROI Pooling layer extracted global region information but struggled for objects with high occlusion while PSROI Pooling layer focused more on local information.
Deformable ROI Pooling Deformable ROI Pooling
Can automatically model the image content without being const... Can automatically model the image content without being constrained by fixed receptive fields.
Learning the relationship between objects with their surround... Learning the relationship between objects with their surrounding context can improve the detector’s ability to understand the scenario. Two aspects: global context and region context. Examples: Spatial Memory Network (SMN), Structure Inference Net (SIN), Gated Bi - Directional CNN (GBDNet).
Robust to nonrigid deformation of objects. Examples: DeepIDNe... Robust to nonrigid deformation of objects. Examples: DeepIDNet developed a deformable - aware pooling layer to encode the deformation information across different object categories. Learning Strategy
To tackle imbalance sampling, localization, acceleration, etc.
Training Stage Training Stage
Data Augmentation Data Augmentation
Horizontal flips of training images is used in training Faste... Horizontal flips of training images are used in training Faster R-CNN detector. A more intensive data augmentation strategy is used in one-stage detectors including rotation, random crops, expanding and color jittering.
Imbalance Sampling Imbalance Sampling
Hard negative sampling, negative proposals with higher classi... Hard negative sampling , negative proposals with higher classification loss were selected for training. Focal loss . The gradient signals of easy samples got suppressed which led the training process to focus more on hard proposals. Gradient harmonizing mechanism (GHM) , not only suppressed easy proposals but also avoided the negative impact of outliers.
Localization Refinement Localization Refinement
Examples: LocNet, MultiPath Network, FitnessNMS Grid R-CNN re... Examples: LocNet, MultiPath Network, FitnessNMS Grid R-CNN replaced linear bounding box regressor with the principle of locating corner keypoints corner- based mechanism.
Cascade Learning Cascade Learning
Coarse-to-fine learning strategy which collects information f... Coarse-to-fine learning strategy which collects information from the output of the given classifiers to build stronger classifiers in a cascaded manner. RefineDet and Cascade R-CNN utilized cascade learning methods in refining object locations.
Adversarial learning, Perceptual GAN for small object detecti... Adversarial learning , Perceptual GAN for small object detection. Learned high - resolution feature representations of small objects via an adversarial scheme. Training from Scratch . For two reasons. The bias of loss functions and data distribution between classification and detection can have an adversarial impact on the performance. Transferring a classification model for detection in a new domain can lead to more challenges. Examples: DSOD (Deeply Supervised Object Detectors), gated recurrent feature pyramid. Knowledge Distillation . Distills the knowledge in an ensemble of models into a single model via teacher - student training scheme.
Testing Stage Testing Stage
Duplicate Removal Duplicate Removal
Non maximum suppression(NMS), predefined threshold will resul... Non maximum suppression(NMS) , the predefined threshold will result in a missing prediction, and this scenario is very common in clustered object detection. Soft-NMS , decayed the confidence score of B as a continuous function F. Avoided eliminating prediction of clustered objects .
Model Acceleration Model Acceleration
Examples: R-FCN, Light Head R-CNN, MobileNet with depth-wise ... Examples: R-FCN, Light Head R-CNN, MobileNet with depth-wise convolution layers. Optimize models off- line, such as model compression and quantization. Acceleration toolkit TensorRT. Conclusion and further thoughts
This series gives you an overview of several critical parts you might find in deep learning for object detection as well as how they build upon each other. Finally, let's conclude the series with the network structure of Faster RCNN with FPN.
Share on Facebook