[Paper Review] Feature Pyramid Networks for Object Detection (FPN, 2017)

Outlines

Reference
Motivation : Detecting Objects at Multiple Scales
Previous Works for Representing Multi-Scale Features
Feature Pyramidal Networks : Multi-Scale Features with Consistently Strong Semantics
Application of FPN in Faster R-CNN
Object Detection Performance Using FPN as the Backbone of RPN

Reference

Motivation : Detecting Objects at Multiple Scales

The main goal behind Feature Pyramid Networks (FPN) is to address the challenge of detecting objects at multiple scales.
FPN introduces a pyramid-like architecture that combines features at different resolutions, better detecting objects of different sizes.
As the model gets deeper with convolutional strides greater than 1, the resolution of feature map decreases.
- Decreased scale of feature map affects the performacne of model in detecting small objects.
- Small objects usually contain fine-grained details that cannot be precisely distinguished by small-sized feature maps due to its spatially limited resolution.
However, each feature map obtained from deeper layer can capture higher level semantics, representing more comprehensive and complex patterns that play a critical role in object classification.

Balancing this trade-off between resolution and the level of features poses a significant challenge in object detection.

Previous Works for Representing Multi-Scale Features

Figure 1

Figure Featurized Image Pyramid : Fig. 1.(a)
- Use an image pyramid with multiple resolutions where each feature is computed from the corresponding image scale independently.
- Able to caputre multiple levels of semantics where feature representations of all levels are semantically strong.
- But, infeasible in terms of practical use, multiplying the amount of required computations by the number of pyramidal levels.
- Only can be used during test time, which will leads to semantic inconsistency between train and test phase.
Single Feature Scale : Fig. 1.(b)
- Use a feature map with a single scale from an image (also single resolution).
- Computationally efficient, but leads to bad performance as this approach can’t properly deal with the object’s scale changes.
Pyramidal Feature Hierarchy : Fig. 1.(c)
- Instead of image pyramid, build the pyramid with multiple scales of features from a single scaled image.
- Gain output from each level of pyramid, abling to represent multi-scale features.
- However, extracted feature map from each level contains different spatial representations that leads to large semantic gap between features from different depths.
- Low level features gained from higher resolution maps are too general and simple, lacking sufficient representational capacity to differentiate between various objects.
- To avoid using too low-level features, the pyramid can be adjusted to start from an intermediate level of depth.
- But this approach can still poses a problem that higher resolution maps can’t be incorporated in feature hierarchy.

Feature Pyramidal Networks : Multi-Scale Features with Consistently Strong Semantics

FPN consists of
Bottom-Up Pathway
- Responsible for feed forward computations of multiple level feature maps from each stage
- Each feature map (output of last layer at each stage) is detnoted as $\large C_{level}$.
- ${C_{2}, C_{3}, C_{4}, C_{5}}$ are the outputs of Conv2, Conv3, Conv4, Conv5, each with strides of 4, 8, 16, 32 pixels, respectively.
- Not include Conv1 in the pyramid.
Top-Down Pathway and Lateral Connections
- Top-Down Pathway upsamples the higher level features by a factor of 2 and combine them with the lower level maps passed from lateral connections by addition.
- Merged output at each level of top down pathway is passed down to subsequent level of pyramid to generate next feature maps.
- Can hallucinates higher resolution features that are also semantically strong.
- Final predictions computed from merged output at each level of the pyramid is denoted as $\large P_{level}$. Thus, ${P_{2}, P_{3}, P_{4}, P_{5}}$ each corresponding to the spatial size of ${C_{2}, C_{3}, C_{4}, C_{5}}$.
Utilizing all these pathways combined, FPN can simultaneously capture spatially conserved lower-level features and higher level complex features with decreased spatial resolution.

Application of FPN in Faster R-CNN

Employ FPN as the feature extractor and region proposal networks (RPN) instead of the VGG-16 originally used, which outputs single scaled feature map.

In the figure, feature pyramid used consists of 6 levels.
Assign anchors of a single scale to the output from each level.
- Pixel area spanned by anchors of each level ${P_{2}, P_{3}, P_{4}, P_{5}, P_{6}}$ is $32^{2}, 64^{2}, 128^{2}, 256^{2}, 512^{2}$ respectively.
- Aspect ratios applied per an anchor are $ {1:2, 1:1, 2:1} $
- In total there are 15 anchors over a pyramid, with 3 per each level.
After extracting multi-scale features from FPN, a network head for object/non-object classification and bounding box regression is attached.
- Realized by 3x3 convolutional layers followed by two 1x1 conv, each responsible for classification and regression task with respect to a set of reference anchors.
To label postivie (objet) or negative (no-object) for each anchor, lower threshold of IoU for positive labeling is 0.7 and upper threshold for negative labeling is 0.3.

Mapping RoIs of Varying Sizes with the Feature Map from Each Pyramdial Level

Assign RoI of width w and h (on the input image) to the level $P_{k}$
- $\large k = k_{0} + \log_2 \left(\frac{\sqrt{wh}}{224} \right)$
This means that smaller RoIs (1/2 size of original input size) should be mapped to finer resolution feature maps ($\large k = k_{0} - 1$)

Visualization of outputs from each layer $\large P_{level}$

From the figure above, it seems quite evident that as the depth of pyramidal level increases, feature maps tend to lose precise spatial information due to limited spatial resolution.
As the level increases, larg objects become increasingly clear and identifiable.
However, fine details of smaller objects gradually disappear and become undetectable in higher-order feature maps.
This demonstrates the importance of integrating both higher resolution lower level features and lower resolution higher level features into final feature maps to effectively capture mulit-scale objects in object detection task.

Object Detection Performance Using FPN as the Backbone of RPN

Table 1. Ablation Studies : Bounding box proposal results using RPN

Table 4. Comparisons of single-model results on the COCO detection benchmark.

Baseline RPN backbone: single-scale map of $C_{4}$ or $C_{5}$
Evaluated on COCO-style Average Recall (AR) and $AR$ on small, medium, and large objects ($AR_{s}$, $AR_{m}$, and $AR_{l}$). Also report results for 100 and 1000 proposals per images ($AR^{100}$ and $AR^{1k}$).
Table 1(d).
- bottom-up pathway alone is no far better compared to baselines.
- This is because FPN with only bottom-up pathway is identical with pyramidal feature hierarchy presented in previous works, which causes large semantic gap between feature maps from each level.
- Higher resolution features that are not merged with lower-resolution higher-level feature maps lack representational capacity, thus impairing the overall performance of the model.
Table 1(e).
- Top-down pathway w/o lateral connections even shows poorer AR scores compared to baseline in some cases, suggesting that spatial representations are not well preserved during sequential downsampling and upsampling processes, which would have been restored by lateral connections present in full FPN.
Table 4.
- Overall, FPN based faster R-CNN out-performs all other competitors with espescially outstanding result in terms of small object detection.

To summarize, FPN build a pyramid of multi-scale feature maps with sequential adoptation of bottom-up pathway and top-down pathway along with lateral connections, overcoming the limitations of prior approaches to deal with small sized object detection.
Using FPN, model can capture fine details of small objects with higher resolution features while consistently maintaining strong semantic representation with lower-resolution but higher-level features passed from upper level of the pyramid.