[Paper Review] Path Aggregation Network for Instance Segmentation (PANet, 2018)

Outlines

Figure 1. Illustration of our framework.

Path Aggregation Network (PANet) is an improvement from Feature Pyramidal Network (FPN) that is used in Mask R-CNN for instance segmentation.
With a novel structures added to the backbone of FPN (Figure 1.(a)), PANet boosts the information flow in instance segmentation.

Figure 2. Building block of Bottom-Up Augmentation Path

While FPN introduced the concept of a top-down pathway that combines high-level semantic information with low-level spatial details, PANet further improves this by incorporating a bottom-up pathway that augments the information flow from low-level to higher levels.
While lacking semantic capacity, low-level patterns possess relatively accurate instance localization with high responses to edges, which is crucial in instance segmentation.
Hence, propagating low-level features to higher level maps significantly enhances the localization capability of the entire feature hierarchy.
Despite the presence of a path connecting low-level structures to the topmost features in FPN, the length of this path are excessively long, extending to over 100 layers (red dahsed line in Figure 1.).
Bottom-up path introduced in PANet can effectively shorten this path to less than 10 layers (green dahsed line in Figure 1.) with extra lateral connections projecting from a feature map at each level in top-down pathway.
Creating a shortcut connecting low-level to higher levels of the pyramid, PANet can transmit much stronger and well-preserved localization information stored in lower-level features across the entire pyramid compared to FPN.

Figure 6. Illustration of Adaptive Feature Pooling

In FPN, proposals are assigned to a feature level according to the size of proposals. Small proposals are assigned to low-level features with high resolution and large proposals are to higher level features with lower resolution.
This kind of strategy is based on an insight that smaller objects are more sensitive to spatial resolution to maintain fine grained details, whereas larger objects are largely robust to smaller details and rather depend on richer semantic context captured from large receptive field.
Although simple and effective, this separation of level based on the proposal scale can lead to non-optimal results where proposals with non-significant pixel difference (like, 10 pixel) are assigned to different level and utilized to make separate predictions.
Further, authors of the paper suggested that importance of features may not be strictly related to the size of objects.
Based on these ides, they added an adaptive feature pooling layer to fuse all these feature maps pooled from different levels into a single integrated map.
Allowing access for small proposals to richer context information captured in higher levels and large proposals to low level features that contain fine details and precise localization benefits the networks to extract features that are more beneficial for following prediction tasks.

Figure 3. Ratio of Features Pooled from Different feature levels

Each colored line represents the proposals of certain size (that are originally assigned to designated level in FPN) and horizontal axis denotes the source of pooled features.
Shows how features extracted from different levels are distributed in proposals with different sizes.
While there may be some variations in the ratio, feature from all levels coexist in each proposal, indicating that mulitple levels of features contribute to the proposal of a single scale.

Mask R-CNN adopted a tiny Fully-Convolutional Network (FCN) to predict masks instead of fully-connected layers (fc layers) based on an idea that mask prediction is more of dense pixel-wise segmentation that preserves spatial representation of feature maps rather than flattening them into a vector.
However, PANet combines two of these structures, utilizing both FCN and fc layers for instance segmentation, to exploit the distinct advantages that each network can provide.
While FCN can give pixel-based prediction with shared parameters across the local receptive field, fc layers assign different weights per each location, allowing for segmentation based on more precise spatial information.
By combining these two properties together, networks can achieve the ability to dynamically adapt to spatial locations along with global semantic context learned from the entire feature map.

Figure 4. Mask prediction branch with FF

Main Path (tiny FCN)
- Consists of 4 convolutional layers (each one has 256 x 3 x 3 filters) followed by one deconvolutional layer with upsampling factor 2.
- predicts binary pixel-wise predictions for each class, decoupling classficiation task and instance segmentation task.
Shorter Path (fc layer)
- Intially branced from conv3 at main path, pass through 2 convolutional layers (both 3 x 3 filters), with the latter one compressing the channel to half to reduce computational cost.
- Output of final conv layer (conv5_fc) enters into a single fc layer that produces 784 x 1 x 1 output to subsequentailly be reshaped to 28 x 28, which is the same size as the mask predicted from FCN.
- Paper explaind that using only one fc layer is to prevent original spatial pattern from collapsing too much by repetitive hidden layers.
Two distinct outputs from each path are aggregated by addition to get final mask prediction.

Table 3. Performance in terms of mask AP and box AP ($AP^{bb}$)

Abberivations : MRB is the Mask R-CNN reported in its original pepers. re-implemented baseline (RBL), we gradually add multi-scale training (MST), multi-GPU synchronized batch normalization (MBN), bottom-up path augmentation (BPA), adaptive feature pooling (AFP), fullyconnected fusion (FF) and heavier head (HHD) for ablation studies.
Every step shows slight improvement compared to no-implementation state and when all these new features are combined, the performance is improved by approximately about 4%p in average for every metrics compared to RBL.