[Paper Review & Implementation] MobileNet V1, V2, V3 (2017 - 2019)
Outlines
References
- MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications, Andrew Howard, (2017)
- MobileNetV2: Inverted Residuals and Linear Bottlenecks, Mark Sandler, (2019)
- Searching for MobileNetV3, Andrew Howard, (2019)
- https://hyukppen.modoo.at/?link=5db82s6p
MobileNet V1
- For resource constrained environment such as mobile devices, it is very important to build computationally efficient networks.
- MobileNet v1 efficiently trade-off between efficiency and accuracy by adopting depthwise separable convolutional filters.
Depthwise Separable Convolution
- Depthwise Convolution
- Idential to the grouped convolution with cardinality equal to total channel size.
- Only performs spatial convolution where each channel of the input requires a single filter and concatenates the resultant feature maps.
- Each feature map contains spatial representation within one chanenl.
- This reduces the network complexities, as each filter is only responsible for convolving with its corresponding input channel.
- Separable Convolution
- Use 1x1 convolution filter to perform channel-wise convolution and linearly combines each input channel into different sets of feature maps.
- Two of convolutions combined can perform roughly identical operation as traditional convolution with no significant decrease in accuracy, but dramatically reduces the computational cost.
Comparison of Computational Cost
- H, W : input shape per channel
- C : the number of input channels
- F : the number of output features
- K : kernel size of each filter
- standard convolution
- H x W x K x K x C x F
- depthwise separable convolution
- H x W x K x K x C + H x W x C x F
- H x W x K x K x C + H x W x C x F / (H x W x K x K x C x F)
- 1/F + 1/K^2
Trade-Off : Accuracy vs Complexity
- Width Multipliers
- $\alpha \in (0, 1]$
- simple hyperparameter $\large \alpha$ to regulate the width of the networks
- multiplied to the number of channel at each layer
- Resolution Multipliers
- $\rho \in (0, 1]$
- adjust the resolution (size of input)
- affects the representation
- no change in the number of parameters
- Can make faster but weaker model by simply applying a few hyperparameters. Setting $\large \alpha$ and $\large \rho$ as 1 equals to a baseline.
MobileNet Architecture
MobileNet V1 Comparison to Other Models
- Achieves higher accuracy with much fewer paramters, indicating efficient use of resources in MobileNet architecture.
MobileNet V2
ReLU Causes Manifold Collapse
-
While ReLU is effective in promoting non-linearity and alleviating the vanishing gradient problem, it can cause the loss of information encoded in negative values as it sets the activations as zero except positive values.
-
This becomes signified when the relu is applied on the data compressed in dimensionality.
-
Manifold in Low-Dimensional Subspace of the Activation Space
- Assuming that manifold in neural networks is embedded in low-dimensional subspaces, ReLU activation can maintain that input manifold as the type of the activation it performs is limited to linear transformation (mulitply 1 if positive, 0 if negative).
- But the preservation of input manifold can only be possible if the dimensionality of activation space is much higher than that of the manifold.Otherwise, entire deletion of information embedded in negative values using ReLU results in significant distortion of the manifold (Collapsed manifold)
Figure 1. ReLU transformation of low-dimensional manifolds into n-dimensional output
- Input : initial input where low-dimensional manifold is embedded
- output dimension ‘n’ increases from 2 to 30
- Operations
- embed into n-dimensional space with random matrix $\large T$ (2xn)
- follwed by ReLU transformation
- restore the data into its original dimensionality with $\large T^{-1}$
- plot the data to see if ReLU transformation preserve the manifold or not.
- Can observe expansion into higher dimensionality before ReLU being applied shows better maintenance of input manifold. This mean that input manifold embedded into much lower-dimensional subspace of activation space can be preserved after ReLU transformation.
Inverted Residual Block
- To tackle this issue, authors developed a unique block called “Inverted Residual Bottleneck” where the residual learning undergoes an expansion layer before entering into ReLU activation and then spatially filtered with a lightweight depthwise 3x3 convolution follwed by 1x1 pointwise convolution with linear activation (no ReLU). Skip connection then is added to final resultant output of pointwise convolution and no ReLU activation applied after concatenation. This is the exact opposite of how typical bottleneck structure as used in MobileNet V1 operates.
Figure 3. Comparison between original residual block in V1 and inverted residual block in V2
-
This enables implementing non-linearity with ReLU without losing input manifold informaiton.
-
Linear activation adopted here plays a crucial as it prevents non-linearities from destroying too much information, resulting in better model performance.
Figure 6. The impact of non-linearities and various types of shortcut connections
Table 1. Architecture of Inveted Bottleneck Residual block used in MobileNet V2
-
can see MobileNet V2 has clear computaitonal advantages over other networks with the lowest max number of memory.
- ReLU6 = min(max(x, 0), 6)
- clamp the maximum value as 6
- increases non-linearity
- computational stability and efficient use of memory resources
- Residual connection only used for the block whose input shape equlas to output shape.
Computational Advantage of MobileNet V2 over V1
- Computational cost of depthwise (DW) vs pointwise (PW) layer
- using 3x3 convolution to input of 64 channel with 128 output feature maps
- DW : 3 x 3 x 64
- PW : 1 x 1 x 64 x 128
- Typically, PW tends to require greater computations compared to DW
- New inverted residuals block takes advantage of this by expanding DW layer and narrowing down the PW layer.
- Comparison : V1 vs V2
- Example
- Inverted Residual Block (V2) : (1 x 1 x 24 x 144) + (3 x 3 x 144) + (1 x 1 x 144 x 24) = 8,208
-
Residuals Block (V1) : (3 x 3 x 144) + (144 x 144) = 22,032
- Inverted residuals block of MobileNet V2 has rougly about 1/3 computations required for MobileNet V1
Table 3. The max number of channels/memory (in Kb) for different network architectures
Table 4. Performance on ImageNet for different architectures
MobileNet V2 Architecture
MobileNet V3
Adding Squeeze and Excitation Layer
- Squeeze-Excitation (SE) module
-
can learn interdependent importance between features (channels)
-
Squeeze : GAP into 1x1xC (channels)
-
Excitation : 2 steps FC layer to get stronger representation for important feature or pattern
-
Squeeze-Excitation
MobileNet V2 + SE module
- being interposed between 3x3 depthwise convolution and 1x1 pointwise convolution
Use of New Non-Linearity : Hard Sigmoind & Hard Swish
-
Sigmoid = $\large \frac{1}{1\,+\,e^{-x}}$
-
Swish = $\large x\times\,sigmoid(x)$
-
Hard Sigmoid = $\large \frac{ReLU6\,(x\,+\,3)}{6}$
-
Hard Swish = $\large x\times\, \frac{ReLU6\,(x\,+\,3)}{6}$
Figure 6
-
compuationally stable with no possible precision error caused by different implementations of approximate sigmoid function
-
ReLU6 as piece-wise function unlike sigmoid significanlty reduces memory use with no discernible difference in accuracy
Redesigning Expensive Layers
Figure 5. : Comparision of original last stage and efficient last stage
- Last feature extraction layer
- original : 1x1 conv - BN - H-swish stage that expands the feature from 320 to 1280 to 7x7 resoultion input
- computations : 7 x 7 x 320 x 1280
- greatly increases the computations and thus, latency
- modified : move the feature extraction stage past the final average pooling layer
- applies 1x1 conv - H-swish to 1x1x960 input (instead of 7x7 spatial resolution) and creates output with 1280 features
- computations : 1 x 1 x 960 x 1280
- The efficient last stage reduces the latency by 7 milliseconds which is 11% of the running time and reduces the number of operations by 30 millions MAdds with almost no loss of accuracy
- original : 1x1 conv - BN - H-swish stage that expands the feature from 320 to 1280 to 7x7 resoultion input
-
Replace ReLU6 -> H-Swish non-linearity
-
Use dfferent type of non-linearity to reduce redundancy
-
As a result, able to reduce the number of filters to 16 while maintaining the same accuracy as 32 filters using either ReLU or swish. This saves an additional 2 milliseconds and 10 million MAdds.
-
MobileNet V3 Architecture (Small & Large)
- Found by 1. platform-aware NAS for block-wise search and 2. NetAdapt for layer-wise search
-
find the one that maximizes the ratio of accuracy change to latency change (maximizes the trade-off slope : $Δ$ accuracy / $Δ$ latency)
-
proposal type
- size of expansion layer
- reduce the bottleneck in all blocks using same bottleneck size to keep residual-connections
-
Table 1. : Specification for MobileNetV3-Large
Table 2. : Specification for MobileNetV3-Small
- All channels are divisible by 8 for computational efficiency
Implementation with PyTorch
def get_divisible(v, divider):
new_v = (int(v + divider/2)//divider)*divider
if new_v < 0.9*v:
new_v += divider
return max(new_v, divider)
class Squeeze_Excite(nn.Module):
def __init__(self, channels, r=4):
super().__init__()
self.squeeze = nn.AdaptiveAvgPool2d((1,1))
self.excitation = nn.Sequential(nn.Linear(channels, channels//r),
nn.ReLU(inplace=True),
nn.Linear(channels//r, channels),
nn.Hardsigmoid(inplace=True))
def forward(self, x):
se = self.squeeze(x)
se = se.reshape(se.shape[0], -1)
se = self.excitation(se)
x *= se.unsqueeze(2).unsqueeze(3)
return x
class InvResBlock(nn.Module):
def __init__(self, in_channels, k, exp_channels, out_channels, se, nl, s):
super().__init__()
self.se = se
self.skip_connect = True if s == 1 and in_channels == out_channels else False
non_linearity = nn.ReLU(inplace=True) if nl=='relu' else nn.Hardswish(inplace=True)
layers = []
if in_channels != exp_channels:
layers += [nn.Sequential(nn.Conv2d(in_channels, exp_channels, 1, bias=False),
nn.BatchNorm2d(exp_channels, momentum=0.99),
non_linearity)]
layers += [nn.Sequential(nn.Conv2d(exp_channels, exp_channels, k, stride=s, padding=(k-1)//2, groups=exp_channels, bias=False),
nn.BatchNorm2d(exp_channels, momentum=0.99),
non_linearity)]
if self.se:
layers += [Squeeze_Excite(exp_channels)]
layers += [nn.Sequential(nn.Conv2d(exp_channels, out_channels, 1, bias=False),
nn.BatchNorm2d(out_channels, momentum=0.99))]
self.layers = nn.Sequential(*layers)
def forward(self, x):
residual_x = self.layers(x)
if self.skip_connect:
return x + residual_x
return residual_x
class MobileNetV3(nn.Module):
def __init__(self, in_channels, cfgs, num_classes, zero_init_residual=True, width_exp=1.):
super().__init__()
self.conv1 = nn.Sequential(nn.Conv2d(in_channels, 16, 3, padding=1, stride=2, bias=False),
nn.BatchNorm2d(16, momentum=0.99),
nn.Hardswish(inplace=True))
in_channels = 16
layers = []
fc_out = cfgs[-1]
for k, exp_channels, out_channels, se, nl, s in cfgs[:-2]:
layers += [InvResBlock(in_channels, k, get_divisible(exp_channels, 8), get_divisible(out_channels*width_exp, 8), se, nl, s)]
in_channels = get_divisible(out_channels*width_exp, 8)
self.blocks = nn.Sequential(*layers)
k, _, out_channels, self.se, nl, s = cfgs[-2]
non_linearity = nn.ReLU(inplace=True) if nl=='relu' else nn.Hardswish(inplace=True)
if self.se:
self.se_block = Squeeze_Excite(out_channels)
self.last_conv = nn.Sequential(nn.Conv2d(in_channels, out_channels, k, bias=False),
nn.BatchNorm2d(out_channels),
non_linearity)
self.gap = nn.AdaptiveAvgPool2d((1,1))
self.classifier = nn.Sequential(nn.Linear(out_channels, fc_out),
nn.Hardswish(inplace=True),
nn.Dropout(p=0.2, inplace=True),
nn.Linear(fc_out, num_classes))
for m in self.modules():
if isinstance(m, nn.Conv2d):
nn.init.kaiming_normal_(m.weight, mode="fan_out")
if m.bias is not None:
nn.init.zeros_(m.bias)
elif isinstance(m, nn.BatchNorm2d):
nn.init.ones_(m.weight)
nn.init.zeros_(m.bias)
elif isinstance(m, nn.Linear):
nn.init.normal_(m.weight, 0, 1e-2)
nn.init.zeros_(m.bias)
def forward(self, x):
x = self.conv1(x)
x = self.blocks(x)
x = self.last_conv(x)
if self.se:
x = self.se_block(x)
x = self.gap(x)
x = torch.flatten(x, start_dim=1)
out = self.classifier(x)
return out
class Config():
def __init__(self):
pass
def large(self):
# kernel, exp, out, se, nl, s
cfgs = [(3, 16, 16, False, 'relu', 1),
(3, 64, 24, False, 'relu', 2),
(3, 72, 24, False, 'relu', 1),
(5, 72, 40, True, 'relu', 2),
(5, 120, 40, True, 'relu', 1),
(5, 120, 40, True, 'relu', 1),
(3, 240, 80, False, 'hs', 2),
(3, 200, 80, False, 'hs', 1),
(3, 184, 80, False, 'hs', 1),
(3, 184, 80, False, 'hs', 1),
(3, 480, 112, True, 'hs', 1),
(3, 672, 112, True, 'hs', 1),
(5, 672, 160, True, 'hs', 2),
(5, 960, 160, True, 'hs', 1),
(5, 960, 160, True, 'hs', 1),
(1, None, 960, False, 'hs', 1),
1280]
return cfgs
def small(self):
cfgs = [(3, 16, 16, True, 'relu', 2),
(3, 72, 24, False, 'relu', 2),
(3, 88, 24, False, 'relu', 1),
(5, 96, 40, True, 'hs', 2),
(5, 240, 40, True, 'hs', 1),
(5, 240, 40, True, 'hs', 1),
(5, 120, 48, True, 'hs', 1),
(5, 144, 48, True, 'hs', 1),
(5, 288, 96, True, 'hs', 2),
(5, 576, 96, True, 'hs', 1),
(5, 576, 96, True, 'hs', 1),
(1, None, 576, True, 'hs', 1),
1024]
return cfgs
Model Summary
configs = Config()
cfgs_small = configs.small()
cfgs_large = configs.large()
large_model = MobileNetV3(3, cfgs_large, 1000)
summary(large_model, input_size=(2, 3, 224, 224), device='cpu')
====================================================================================================
Layer (type:depth-idx) Output Shape Param #
====================================================================================================
MobileNetV3 [2, 1000] --
├─Sequential: 1-1 [2, 16, 112, 112] --
│ └─Conv2d: 2-1 [2, 16, 112, 112] 432
│ └─BatchNorm2d: 2-2 [2, 16, 112, 112] 32
│ └─Hardswish: 2-3 [2, 16, 112, 112] --
├─Sequential: 1-2 [2, 160, 7, 7] --
│ └─InvResBlock: 2-4 [2, 16, 112, 112] --
│ │ └─Sequential: 3-1 [2, 16, 112, 112] 464
│ └─InvResBlock: 2-5 [2, 24, 56, 56] --
│ │ └─Sequential: 3-2 [2, 24, 56, 56] 3,440
│ └─InvResBlock: 2-6 [2, 24, 56, 56] --
│ │ └─Sequential: 3-3 [2, 24, 56, 56] 4,440
│ └─InvResBlock: 2-7 [2, 40, 28, 28] --
│ │ └─Sequential: 3-4 [2, 40, 28, 28] 9,458
│ └─InvResBlock: 2-8 [2, 40, 28, 28] --
│ │ └─Sequential: 3-5 [2, 40, 28, 28] 20,510
│ └─InvResBlock: 2-9 [2, 40, 28, 28] --
│ │ └─Sequential: 3-6 [2, 40, 28, 28] 20,510
│ └─InvResBlock: 2-10 [2, 80, 14, 14] --
│ │ └─Sequential: 3-7 [2, 80, 14, 14] 32,080
│ └─InvResBlock: 2-11 [2, 80, 14, 14] --
│ │ └─Sequential: 3-8 [2, 80, 14, 14] 34,760
│ └─InvResBlock: 2-12 [2, 80, 14, 14] --
│ │ └─Sequential: 3-9 [2, 80, 14, 14] 31,992
│ └─InvResBlock: 2-13 [2, 80, 14, 14] --
│ │ └─Sequential: 3-10 [2, 80, 14, 14] 31,992
│ └─InvResBlock: 2-14 [2, 112, 14, 14] --
│ │ └─Sequential: 3-11 [2, 112, 14, 14] 214,424
│ └─InvResBlock: 2-15 [2, 112, 14, 14] --
│ │ └─Sequential: 3-12 [2, 112, 14, 14] 386,120
│ └─InvResBlock: 2-16 [2, 160, 7, 7] --
│ │ └─Sequential: 3-13 [2, 160, 7, 7] 429,224
│ └─InvResBlock: 2-17 [2, 160, 7, 7] --
│ │ └─Sequential: 3-14 [2, 160, 7, 7] 797,360
│ └─InvResBlock: 2-18 [2, 160, 7, 7] --
│ │ └─Sequential: 3-15 [2, 160, 7, 7] 797,360
├─Sequential: 1-3 [2, 960, 7, 7] --
│ └─Conv2d: 2-19 [2, 960, 7, 7] 153,600
│ └─BatchNorm2d: 2-20 [2, 960, 7, 7] 1,920
│ └─Hardswish: 2-21 [2, 960, 7, 7] --
├─AdaptiveAvgPool2d: 1-4 [2, 960, 1, 1] --
├─Sequential: 1-5 [2, 1000] --
│ └─Linear: 2-22 [2, 1280] 1,230,080
│ └─Hardswish: 2-23 [2, 1280] --
│ └─Dropout: 2-24 [2, 1280] --
│ └─Linear: 2-25 [2, 1000] 1,281,000
====================================================================================================
Total params: 5,481,198
Trainable params: 5,481,198
Non-trainable params: 0
Total mult-adds (M): 433.24
====================================================================================================
Input size (MB): 1.20
Forward/backward pass size (MB): 140.91
Params size (MB): 21.92
Estimated Total Size (MB): 164.04
====================================================================================================
small_model = MobileNetV3(3, cfgs_small, 1000)
summary(small_model, input_size=(2, 3, 224, 224), device='cpu')
====================================================================================================
Layer (type:depth-idx) Output Shape Param #
====================================================================================================
MobileNetV3 [2, 1000] --
├─Sequential: 1-1 [2, 16, 112, 112] --
│ └─Conv2d: 2-1 [2, 16, 112, 112] 432
│ └─BatchNorm2d: 2-2 [2, 16, 112, 112] 32
│ └─Hardswish: 2-3 [2, 16, 112, 112] --
├─Sequential: 1-2 [2, 96, 7, 7] --
│ └─InvResBlock: 2-4 [2, 16, 56, 56] --
│ │ └─Sequential: 3-1 [2, 16, 56, 56] 612
│ └─InvResBlock: 2-5 [2, 24, 28, 28] --
│ │ └─Sequential: 3-2 [2, 24, 28, 28] 3,864
│ └─InvResBlock: 2-6 [2, 24, 28, 28] --
│ │ └─Sequential: 3-3 [2, 24, 28, 28] 5,416
│ └─InvResBlock: 2-7 [2, 40, 14, 14] --
│ │ └─Sequential: 3-4 [2, 40, 14, 14] 13,736
│ └─InvResBlock: 2-8 [2, 40, 14, 14] --
│ │ └─Sequential: 3-5 [2, 40, 14, 14] 55,340
│ └─InvResBlock: 2-9 [2, 40, 14, 14] --
│ │ └─Sequential: 3-6 [2, 40, 14, 14] 55,340
│ └─InvResBlock: 2-10 [2, 48, 14, 14] --
│ │ └─Sequential: 3-7 [2, 48, 14, 14] 21,486
│ └─InvResBlock: 2-11 [2, 48, 14, 14] --
│ │ └─Sequential: 3-8 [2, 48, 14, 14] 28,644
│ └─InvResBlock: 2-12 [2, 96, 7, 7] --
│ │ └─Sequential: 3-9 [2, 96, 7, 7] 91,848
│ └─InvResBlock: 2-13 [2, 96, 7, 7] --
│ │ └─Sequential: 3-10 [2, 96, 7, 7] 294,096
│ └─InvResBlock: 2-14 [2, 96, 7, 7] --
│ │ └─Sequential: 3-11 [2, 96, 7, 7] 294,096
├─Sequential: 1-3 [2, 576, 7, 7] --
│ └─Conv2d: 2-15 [2, 576, 7, 7] 55,296
│ └─BatchNorm2d: 2-16 [2, 576, 7, 7] 1,152
│ └─Hardswish: 2-17 [2, 576, 7, 7] --
├─Squeeze_Excite: 1-4 [2, 576, 7, 7] --
│ └─AdaptiveAvgPool2d: 2-18 [2, 576, 1, 1] --
│ └─Sequential: 2-19 [2, 576] --
│ │ └─Linear: 3-12 [2, 144] 83,088
│ │ └─ReLU: 3-13 [2, 144] --
│ │ └─Linear: 3-14 [2, 576] 83,520
│ │ └─Hardsigmoid: 3-15 [2, 576] --
├─AdaptiveAvgPool2d: 1-5 [2, 576, 1, 1] --
├─Sequential: 1-6 [2, 1000] --
│ └─Linear: 2-20 [2, 1024] 590,848
│ └─Hardswish: 2-21 [2, 1024] --
│ └─Dropout: 2-22 [2, 1024] --
│ └─Linear: 2-23 [2, 1000] 1,025,000
====================================================================================================
Total params: 2,703,846
Trainable params: 2,703,846
Non-trainable params: 0
Total mult-adds (M): 113.38
====================================================================================================
Input size (MB): 1.20
Forward/backward pass size (MB): 45.30
Params size (MB): 10.82
Estimated Total Size (MB): 57.32
====================================================================================================