[Paper Review & Implementation] MobileNet V1, V2, V3 (2017 - 2019)

Outlines

References
MobileNet V1
MobileNet V2
MobileNet V3
Implementation with PyTorch
Model Summary

References

MobileNet V1

For resource constrained environment such as mobile devices, it is very important to build computationally efficient networks.
MobileNet v1 efficiently trade-off between efficiency and accuracy by adopting depthwise separable convolutional filters.

Depthwise Separable Convolution

Depthwise Convolution
- Idential to the grouped convolution with cardinality equal to total channel size.
- Only performs spatial convolution where each channel of the input requires a single filter and concatenates the resultant feature maps.
- Each feature map contains spatial representation within one chanenl.
- This reduces the network complexities, as each filter is only responsible for convolving with its corresponding input channel.
Separable Convolution
- Use 1x1 convolution filter to perform channel-wise convolution and linearly combines each input channel into different sets of feature maps.

Two of convolutions combined can perform roughly identical operation as traditional convolution with no significant decrease in accuracy, but dramatically reduces the computational cost.

Comparison of Computational Cost

H, W : input shape per channel
C : the number of input channels
F : the number of output features
K : kernel size of each filter
standard convolution
- H x W x K x K x C x F
depthwise separable convolution
- H x W x K x K x C + H x W x C x F
H x W x K x K x C + H x W x C x F / (H x W x K x K x C x F)
- 1/F + 1/K^2

Trade-Off : Accuracy vs Complexity

Width Multipliers
- $\alpha \in (0, 1]$
- simple hyperparameter $\large \alpha$ to regulate the width of the networks
- multiplied to the number of channel at each layer
Resolution Multipliers
- $\rho \in (0, 1]$
- adjust the resolution (size of input)
- affects the representation
- no change in the number of parameters

Can make faster but weaker model by simply applying a few hyperparameters. Setting $\large \alpha$ and $\large \rho$ as 1 equals to a baseline.

MobileNet Architecture

MobileNet V1 Comparison to Other Models

Achieves higher accuracy with much fewer paramters, indicating efficient use of resources in MobileNet architecture.

MobileNet V2

ReLU Causes Manifold Collapse

While ReLU is effective in promoting non-linearity and alleviating the vanishing gradient problem, it can cause the loss of information encoded in negative values as it sets the activations as zero except positive values.
This becomes signified when the relu is applied on the data compressed in dimensionality.
Manifold in Low-Dimensional Subspace of the Activation Space
- Assuming that manifold in neural networks is embedded in low-dimensional subspaces, ReLU activation can maintain that input manifold as the type of the activation it performs is limited to linear transformation (mulitply 1 if positive, 0 if negative).
- But the preservation of input manifold can only be possible if the dimensionality of activation space is much higher than that of the manifold.Otherwise, entire deletion of information embedded in negative values using ReLU results in significant distortion of the manifold (Collapsed manifold)

Figure 1. ReLU transformation of low-dimensional manifolds into n-dimensional output

Input : initial input where low-dimensional manifold is embedded
output dimension ‘n’ increases from 2 to 30
Operations
1. embed into n-dimensional space with random matrix $\large T$ (2xn)
2. follwed by ReLU transformation
3. restore the data into its original dimensionality with $\large T^{-1}$
4. plot the data to see if ReLU transformation preserve the manifold or not.
Can observe expansion into higher dimensionality before ReLU being applied shows better maintenance of input manifold. This mean that input manifold embedded into much lower-dimensional subspace of activation space can be preserved after ReLU transformation.

Inverted Residual Block

To tackle this issue, authors developed a unique block called “Inverted Residual Bottleneck” where the residual learning undergoes an expansion layer before entering into ReLU activation and then spatially filtered with a lightweight depthwise 3x3 convolution follwed by 1x1 pointwise convolution with linear activation (no ReLU). Skip connection then is added to final resultant output of pointwise convolution and no ReLU activation applied after concatenation. This is the exact opposite of how typical bottleneck structure as used in MobileNet V1 operates.

Figure 3. Comparison between original residual block in V1 and inverted residual block in V2

This enables implementing non-linearity with ReLU without losing input manifold informaiton.
Linear activation adopted here plays a crucial as it prevents non-linearities from destroying too much information, resulting in better model performance.

Figure 6. The impact of non-linearities and various types of shortcut connections

Table 1. Architecture of Inveted Bottleneck Residual block used in MobileNet V2

can see MobileNet V2 has clear computaitonal advantages over other networks with the lowest max number of memory.
ReLU6 = min(max(x, 0), 6)
- clamp the maximum value as 6
- increases non-linearity
- computational stability and efficient use of memory resources
Residual connection only used for the block whose input shape equlas to output shape.

Computational Advantage of MobileNet V2 over V1

Computational cost of depthwise (DW) vs pointwise (PW) layer
- using 3x3 convolution to input of 64 channel with 128 output feature maps
- DW : 3 x 3 x 64
- PW : 1 x 1 x 64 x 128
- Typically, PW tends to require greater computations compared to DW
- New inverted residuals block takes advantage of this by expanding DW layer and narrowing down the PW layer.
Comparison : V1 vs V2
Example

Inverted Residual Block (V2) : (1 x 1 x 24 x 144) + (3 x 3 x 144) + (1 x 1 x 144 x 24) = 8,208
Residuals Block (V1) : (3 x 3 x 144) + (144 x 144) = 22,032
Inverted residuals block of MobileNet V2 has rougly about 1/3 computations required for MobileNet V1

Table 3. The max number of channels/memory (in Kb) for different network architectures

Table 4. Performance on ImageNet for different architectures

MobileNet V2 Architecture

MobileNet V3

Adding Squeeze and Excitation Layer

Squeeze-Excitation (SE) module
- can learn interdependent importance between features (channels)
- Squeeze : GAP into 1x1xC (channels)
- Excitation : 2 steps FC layer to get stronger representation for important feature or pattern

Squeeze-Excitation

MobileNet V2 + SE module

being interposed between 3x3 depthwise convolution and 1x1 pointwise convolution

Use of New Non-Linearity : Hard Sigmoind & Hard Swish

Sigmoid = $\large \frac{1}{1\,+\,e^{-x}}$
Swish = $\large x\times\,sigmoid(x)$
Hard Sigmoid = $\large \frac{ReLU6\,(x\,+\,3)}{6}$
Hard Swish = $\large x\times\, \frac{ReLU6\,(x\,+\,3)}{6}$

Figure 6

compuationally stable with no possible precision error caused by different implementations of approximate sigmoid function
ReLU6 as piece-wise function unlike sigmoid significanlty reduces memory use with no discernible difference in accuracy

Redesigning Expensive Layers

Figure 5. : Comparision of original last stage and efficient last stage

Last feature extraction layer
- original : 1x1 conv - BN - H-swish stage that expands the feature from 320 to 1280 to 7x7 resoultion input
  - computations : 7 x 7 x 320 x 1280
  - greatly increases the computations and thus, latency
- modified : move the feature extraction stage past the final average pooling layer
  - applies 1x1 conv - H-swish to 1x1x960 input (instead of 7x7 spatial resolution) and creates output with 1280 features
  - computations : 1 x 1 x 960 x 1280
- The efficient last stage reduces the latency by 7 milliseconds which is 11% of the running time and reduces the number of operations by 30 millions MAdds with almost no loss of accuracy

Replace ReLU6 -> H-Swish non-linearity
- Use dfferent type of non-linearity to reduce redundancy
- As a result, able to reduce the number of filters to 16 while maintaining the same accuracy as 32 filters using either ReLU or swish. This saves an additional 2 milliseconds and 10 million MAdds.

MobileNet V3 Architecture (Small & Large)

Found by 1. platform-aware NAS for block-wise search and 2. NetAdapt for layer-wise search
- find the one that maximizes the ratio of accuracy change to latency change (maximizes the trade-off slope : $Δ$ accuracy / $Δ$ latency)
- proposal type
  - size of expansion layer
  - reduce the bottleneck in all blocks using same bottleneck size to keep residual-connections

Table 1. : Specification for MobileNetV3-Large

Table 2. : Specification for MobileNetV3-Small

All channels are divisible by 8 for computational efficiency

Implementation with PyTorch

def get_divisible(v, divider):
    new_v = (int(v + divider/2)//divider)*divider
    if new_v < 0.9*v:
        new_v += divider
    return max(new_v, divider)


class Squeeze_Excite(nn.Module):
    def __init__(self, channels, r=4):
        super().__init__()
        self.squeeze = nn.AdaptiveAvgPool2d((1,1))
        self.excitation = nn.Sequential(nn.Linear(channels, channels//r),
                                        nn.ReLU(inplace=True),
                                        nn.Linear(channels//r, channels),
                                        nn.Hardsigmoid(inplace=True))

    def forward(self, x):
        se = self.squeeze(x)
        se = se.reshape(se.shape[0], -1)
        se = self.excitation(se)
        x *= se.unsqueeze(2).unsqueeze(3)
        return x


class InvResBlock(nn.Module):
    def __init__(self, in_channels, k, exp_channels, out_channels, se, nl, s):
        super().__init__()
        self.se = se
        self.skip_connect = True if s == 1 and in_channels == out_channels else False
        non_linearity = nn.ReLU(inplace=True) if nl=='relu' else nn.Hardswish(inplace=True)

        layers = []
        if in_channels != exp_channels:
            layers += [nn.Sequential(nn.Conv2d(in_channels, exp_channels, 1, bias=False),
                                     nn.BatchNorm2d(exp_channels, momentum=0.99),
                                     non_linearity)]

        layers += [nn.Sequential(nn.Conv2d(exp_channels, exp_channels, k, stride=s, padding=(k-1)//2, groups=exp_channels, bias=False),
                                 nn.BatchNorm2d(exp_channels, momentum=0.99),
                                 non_linearity)]
        if self.se:
            layers += [Squeeze_Excite(exp_channels)]
        layers += [nn.Sequential(nn.Conv2d(exp_channels, out_channels, 1, bias=False),
                                 nn.BatchNorm2d(out_channels, momentum=0.99))]

        self.layers = nn.Sequential(*layers)

    def forward(self, x):
        residual_x = self.layers(x)
        if self.skip_connect:
            return x + residual_x
        return residual_x


class MobileNetV3(nn.Module):
    def __init__(self, in_channels, cfgs, num_classes, zero_init_residual=True, width_exp=1.):
        super().__init__()
        self.conv1 = nn.Sequential(nn.Conv2d(in_channels, 16, 3, padding=1, stride=2, bias=False),
                                    nn.BatchNorm2d(16, momentum=0.99),
                                    nn.Hardswish(inplace=True))
        in_channels = 16
        layers = []
        fc_out = cfgs[-1]
        for k, exp_channels, out_channels, se, nl, s in cfgs[:-2]:
            layers += [InvResBlock(in_channels, k, get_divisible(exp_channels, 8), get_divisible(out_channels*width_exp, 8), se, nl, s)]
            in_channels = get_divisible(out_channels*width_exp, 8)

        self.blocks = nn.Sequential(*layers)

        k, _, out_channels, self.se, nl, s = cfgs[-2]
        non_linearity = nn.ReLU(inplace=True) if nl=='relu' else nn.Hardswish(inplace=True)

        if self.se:
            self.se_block = Squeeze_Excite(out_channels)
        self.last_conv = nn.Sequential(nn.Conv2d(in_channels, out_channels, k, bias=False),
                                       nn.BatchNorm2d(out_channels),
                                       non_linearity)
        self.gap = nn.AdaptiveAvgPool2d((1,1))
        self.classifier = nn.Sequential(nn.Linear(out_channels, fc_out),
                                        nn.Hardswish(inplace=True),
                                        nn.Dropout(p=0.2, inplace=True),
                                        nn.Linear(fc_out, num_classes))

        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                nn.init.kaiming_normal_(m.weight, mode="fan_out")
                if m.bias is not None:
                    nn.init.zeros_(m.bias)
            elif isinstance(m, nn.BatchNorm2d):
                nn.init.ones_(m.weight)
                nn.init.zeros_(m.bias)
            elif isinstance(m, nn.Linear):
                nn.init.normal_(m.weight, 0, 1e-2)
                nn.init.zeros_(m.bias)


    def forward(self, x):
        x = self.conv1(x)
        x = self.blocks(x)
        x = self.last_conv(x)
        if self.se:
            x = self.se_block(x)
        x = self.gap(x)
        x = torch.flatten(x, start_dim=1)
        out = self.classifier(x)
        return out


class Config():
    def __init__(self):
        pass

    def large(self):
        # kernel, exp, out, se, nl, s
        cfgs = [(3, 16, 16, False, 'relu', 1),
                (3, 64, 24, False, 'relu', 2),
                (3, 72, 24, False, 'relu', 1),
                (5, 72, 40, True, 'relu', 2),
                (5, 120, 40, True, 'relu', 1),
                (5, 120, 40, True, 'relu', 1),
                (3, 240, 80, False, 'hs', 2),
                (3, 200, 80, False, 'hs', 1),
                (3, 184, 80, False, 'hs', 1),
                (3, 184, 80, False, 'hs', 1),
                (3, 480, 112, True, 'hs', 1),
                (3, 672, 112, True, 'hs', 1),
                (5, 672, 160, True, 'hs', 2),
                (5, 960, 160, True, 'hs', 1),
                (5, 960, 160, True, 'hs', 1),
                (1, None, 960, False, 'hs', 1),
                1280]

        return cfgs

    def small(self):
        cfgs = [(3, 16, 16, True, 'relu', 2),
                (3, 72, 24, False, 'relu', 2),
                (3, 88, 24, False, 'relu', 1),
                (5, 96, 40, True, 'hs', 2),
                (5, 240, 40, True, 'hs', 1),
                (5, 240, 40, True, 'hs', 1),
                (5, 120, 48, True, 'hs', 1),
                (5, 144, 48, True, 'hs', 1),
                (5, 288, 96, True, 'hs', 2),
                (5, 576, 96, True, 'hs', 1),
                (5, 576, 96, True, 'hs', 1),
                (1, None, 576, True, 'hs', 1),
                1024]

        return cfgs

Model Summary

configs = Config()
cfgs_small = configs.small()
cfgs_large = configs.large()

large_model = MobileNetV3(3, cfgs_large, 1000)
summary(large_model, input_size=(2, 3, 224, 224), device='cpu')

====================================================================================================
Layer (type:depth-idx)                             Output Shape              Param #
====================================================================================================
MobileNetV3                                        [2, 1000]                 --
├─Sequential: 1-1                                  [2, 16, 112, 112]         --
│    └─Conv2d: 2-1                                 [2, 16, 112, 112]         432
│    └─BatchNorm2d: 2-2                            [2, 16, 112, 112]         32
│    └─Hardswish: 2-3                              [2, 16, 112, 112]         --
├─Sequential: 1-2                                  [2, 160, 7, 7]            --
│    └─InvResBlock: 2-4                            [2, 16, 112, 112]         --
│    │    └─Sequential: 3-1                        [2, 16, 112, 112]         464
│    └─InvResBlock: 2-5                            [2, 24, 56, 56]           --
│    │    └─Sequential: 3-2                        [2, 24, 56, 56]           3,440
│    └─InvResBlock: 2-6                            [2, 24, 56, 56]           --
│    │    └─Sequential: 3-3                        [2, 24, 56, 56]           4,440
│    └─InvResBlock: 2-7                            [2, 40, 28, 28]           --
│    │    └─Sequential: 3-4                        [2, 40, 28, 28]           9,458
│    └─InvResBlock: 2-8                            [2, 40, 28, 28]           --
│    │    └─Sequential: 3-5                        [2, 40, 28, 28]           20,510
│    └─InvResBlock: 2-9                            [2, 40, 28, 28]           --
│    │    └─Sequential: 3-6                        [2, 40, 28, 28]           20,510
│    └─InvResBlock: 2-10                           [2, 80, 14, 14]           --
│    │    └─Sequential: 3-7                        [2, 80, 14, 14]           32,080
│    └─InvResBlock: 2-11                           [2, 80, 14, 14]           --
│    │    └─Sequential: 3-8                        [2, 80, 14, 14]           34,760
│    └─InvResBlock: 2-12                           [2, 80, 14, 14]           --
│    │    └─Sequential: 3-9                        [2, 80, 14, 14]           31,992
│    └─InvResBlock: 2-13                           [2, 80, 14, 14]           --
│    │    └─Sequential: 3-10                       [2, 80, 14, 14]           31,992
│    └─InvResBlock: 2-14                           [2, 112, 14, 14]          --
│    │    └─Sequential: 3-11                       [2, 112, 14, 14]          214,424
│    └─InvResBlock: 2-15                           [2, 112, 14, 14]          --
│    │    └─Sequential: 3-12                       [2, 112, 14, 14]          386,120
│    └─InvResBlock: 2-16                           [2, 160, 7, 7]            --
│    │    └─Sequential: 3-13                       [2, 160, 7, 7]            429,224
│    └─InvResBlock: 2-17                           [2, 160, 7, 7]            --
│    │    └─Sequential: 3-14                       [2, 160, 7, 7]            797,360
│    └─InvResBlock: 2-18                           [2, 160, 7, 7]            --
│    │    └─Sequential: 3-15                       [2, 160, 7, 7]            797,360
├─Sequential: 1-3                                  [2, 960, 7, 7]            --
│    └─Conv2d: 2-19                                [2, 960, 7, 7]            153,600
│    └─BatchNorm2d: 2-20                           [2, 960, 7, 7]            1,920
│    └─Hardswish: 2-21                             [2, 960, 7, 7]            --
├─AdaptiveAvgPool2d: 1-4                           [2, 960, 1, 1]            --
├─Sequential: 1-5                                  [2, 1000]                 --
│    └─Linear: 2-22                                [2, 1280]                 1,230,080
│    └─Hardswish: 2-23                             [2, 1280]                 --
│    └─Dropout: 2-24                               [2, 1280]                 --
│    └─Linear: 2-25                                [2, 1000]                 1,281,000
====================================================================================================
Total params: 5,481,198
Trainable params: 5,481,198
Non-trainable params: 0
Total mult-adds (M): 433.24
====================================================================================================
Input size (MB): 1.20
Forward/backward pass size (MB): 140.91
Params size (MB): 21.92
Estimated Total Size (MB): 164.04
====================================================================================================

small_model = MobileNetV3(3, cfgs_small, 1000)
summary(small_model, input_size=(2, 3, 224, 224), device='cpu')

====================================================================================================
Layer (type:depth-idx)                             Output Shape              Param #
====================================================================================================
MobileNetV3                                        [2, 1000]                 --
├─Sequential: 1-1                                  [2, 16, 112, 112]         --
│    └─Conv2d: 2-1                                 [2, 16, 112, 112]         432
│    └─BatchNorm2d: 2-2                            [2, 16, 112, 112]         32
│    └─Hardswish: 2-3                              [2, 16, 112, 112]         --
├─Sequential: 1-2                                  [2, 96, 7, 7]             --
│    └─InvResBlock: 2-4                            [2, 16, 56, 56]           --
│    │    └─Sequential: 3-1                        [2, 16, 56, 56]           612
│    └─InvResBlock: 2-5                            [2, 24, 28, 28]           --
│    │    └─Sequential: 3-2                        [2, 24, 28, 28]           3,864
│    └─InvResBlock: 2-6                            [2, 24, 28, 28]           --
│    │    └─Sequential: 3-3                        [2, 24, 28, 28]           5,416
│    └─InvResBlock: 2-7                            [2, 40, 14, 14]           --
│    │    └─Sequential: 3-4                        [2, 40, 14, 14]           13,736
│    └─InvResBlock: 2-8                            [2, 40, 14, 14]           --
│    │    └─Sequential: 3-5                        [2, 40, 14, 14]           55,340
│    └─InvResBlock: 2-9                            [2, 40, 14, 14]           --
│    │    └─Sequential: 3-6                        [2, 40, 14, 14]           55,340
│    └─InvResBlock: 2-10                           [2, 48, 14, 14]           --
│    │    └─Sequential: 3-7                        [2, 48, 14, 14]           21,486
│    └─InvResBlock: 2-11                           [2, 48, 14, 14]           --
│    │    └─Sequential: 3-8                        [2, 48, 14, 14]           28,644
│    └─InvResBlock: 2-12                           [2, 96, 7, 7]             --
│    │    └─Sequential: 3-9                        [2, 96, 7, 7]             91,848
│    └─InvResBlock: 2-13                           [2, 96, 7, 7]             --
│    │    └─Sequential: 3-10                       [2, 96, 7, 7]             294,096
│    └─InvResBlock: 2-14                           [2, 96, 7, 7]             --
│    │    └─Sequential: 3-11                       [2, 96, 7, 7]             294,096
├─Sequential: 1-3                                  [2, 576, 7, 7]            --
│    └─Conv2d: 2-15                                [2, 576, 7, 7]            55,296
│    └─BatchNorm2d: 2-16                           [2, 576, 7, 7]            1,152
│    └─Hardswish: 2-17                             [2, 576, 7, 7]            --
├─Squeeze_Excite: 1-4                              [2, 576, 7, 7]            --
│    └─AdaptiveAvgPool2d: 2-18                     [2, 576, 1, 1]            --
│    └─Sequential: 2-19                            [2, 576]                  --
│    │    └─Linear: 3-12                           [2, 144]                  83,088
│    │    └─ReLU: 3-13                             [2, 144]                  --
│    │    └─Linear: 3-14                           [2, 576]                  83,520
│    │    └─Hardsigmoid: 3-15                      [2, 576]                  --
├─AdaptiveAvgPool2d: 1-5                           [2, 576, 1, 1]            --
├─Sequential: 1-6                                  [2, 1000]                 --
│    └─Linear: 2-20                                [2, 1024]                 590,848
│    └─Hardswish: 2-21                             [2, 1024]                 --
│    └─Dropout: 2-22                               [2, 1024]                 --
│    └─Linear: 2-23                                [2, 1000]                 1,025,000
====================================================================================================
Total params: 2,703,846
Trainable params: 2,703,846
Non-trainable params: 0
Total mult-adds (M): 113.38
====================================================================================================
Input size (MB): 1.20
Forward/backward pass size (MB): 45.30
Params size (MB): 10.82
Estimated Total Size (MB): 57.32
====================================================================================================