[Paper Review & Implementation] Very Deep Convolutional Networks for Large-Scale Image Recognition (VGGNet, 2015)
2023, May 22
Outlines
References
- VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION (Karen Simonyan & Andrew Zisserman)
- https://pytorch.org/hub/pytorch_vision_vgg/
- https://hyukppen.modoo.at/?link=5db82s6p
VGGnet Architecture
-
LRN (Local Response Normalisation) : doesn’t really contribute to improving performance
- useful for unbounded activations (e.g. ReLU)
- damps the output among neighborhoods with uniformly large responses and creates higher contrast in activation map, allowing for detection of distinctively large activation within neighborhood.
- not used anymore, instead can use batch normalization
-
repeat 3x3 convolution
- can build deepest-possible networks with locational focus: using smallest sized receptive field to capture all direcitons (left, righ, up, down), which prevents representational bottleneck that might occur due to an extreme compression with large receptive fields
- increase non-linearity by adding extra maxpooling layers between deep 3x3 conv layers -> can build more complex and non-linear predicting functions
- save computational resources : can reduce dimension of parameters by factorizing large sized feature maps into multiple smaller sized maps while maintaining the size of receptive field. (share parameters between adjacent pixels)
- instead of using one 5x5 feature map, can divide it into two 3x3 maps with 1 non-linearity activatation added in-between.
Implementation with PyTorch
configs = {'A' : [64, 'M', 128, 'M', 256, 256, 'M', 512, 512, 'M', 512, 512, 'M'],
'B' : [64, 64, 'M', 128, 128, 'M', 256, 256, 'M', 512, 512, 'M', 512, 512, 'M'],
'C' : [64, 64, 'M', 128, 128, 'M', 256, 256, (256,1), 'M', 512, 512, (512,1), 'M', 512, 512, (512,1), 'M'],
'D' : [64, 64, 'M', 128, 128, 'M', 256, 256, 256, 'M', 512, 512, 512, 'M', 512, 512, 512, 'M'],
'E' : [64, 64, 'M', 128, 128, 'M', 256, 256, 256, 256, 'M', 512, 512, 512, 512, 'M', 512, 512, 512, 512, 'M']}
# input_shape = (N, 3, 224, 224)
class VGGnet(nn.Module):
def __init__(self, config, bn, num_classes, init_weights=True, p=0.5):
super().__init__()
layers = self.build_layers(config, bn)
self.features = nn.Sequential(*layers) # (512,7,7)
self.avgpool = nn.AdaptiveAvgPool2d((7,7)) # set the shape of output as (7,7)
self.fc = nn.Sequential(nn.Linear(512*7*7, 4096),
nn.ReLU(),
nn.Dropout(p),
nn.Linear(4096,4096),
nn.ReLU(),
nn.Dropout(p),
nn.ReLU())
self.classifier = nn.Linear(4096,num_classes)
if init_weights:
for m in self.modules():
if isinstance(m, nn.Conv2d):
nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
if m.bias is not None:
nn.init.constant_(m.bias, 0)
elif isinstance(m, nn.Linear):
nn.init.normal_(m.weight, 0, 1e-2)
nn.init.constant_(m.bias, 0)
elif isinstance(m, nn.BatchNorm2d):
nn.init.constant_(m.weight, 1)
nn.init.constant_(m.bias, 0)
def forward(self, x):
x = self.features(x)
x = self.avgpool(x)
x = torch.flatten(x, 1)
x = self.fc(x)
out = self.classifier(x)
return out
def build_layers(self, config, bn):
layers = []
in_channel = 3
for v in config:
if v == 'M':
layers += [nn.MaxPool2d(2)]
else:
if type(v) == int:
if bn:
layers += [nn.Conv2d(in_channel, v, 3, padding=1),
nn.BatchNorm2d(v),
nn.ReLU()]
else:
layers += [nn.Conv2d(in_channel, v, 3, padding=1),
nn.ReLU()]
else:
v, size = v
if bn:
layers += [nn.Conv2d(in_channel, v, size),
nn.BatchNorm2d(v),
nn.ReLU()]
else:
layers += [nn.Conv2d(in_channel, v, size),
nn.ReLU()]
in_channel = v
return layers
Model Summary
model = VGGnet(configs['E'], True, 1000)
summary(model, input_size=(2, 3, 224, 224), device='cpu')
==========================================================================================
Layer (type:depth-idx) Output Shape Param #
==========================================================================================
VGGnet [2, 1000] --
├─Sequential: 1-1 [2, 512, 7, 7] --
│ └─Conv2d: 2-1 [2, 64, 224, 224] 1,792
│ └─BatchNorm2d: 2-2 [2, 64, 224, 224] 128
│ └─ReLU: 2-3 [2, 64, 224, 224] --
│ └─Conv2d: 2-4 [2, 64, 224, 224] 36,928
│ └─BatchNorm2d: 2-5 [2, 64, 224, 224] 128
│ └─ReLU: 2-6 [2, 64, 224, 224] --
│ └─MaxPool2d: 2-7 [2, 64, 112, 112] --
│ └─Conv2d: 2-8 [2, 128, 112, 112] 73,856
│ └─BatchNorm2d: 2-9 [2, 128, 112, 112] 256
│ └─ReLU: 2-10 [2, 128, 112, 112] --
│ └─Conv2d: 2-11 [2, 128, 112, 112] 147,584
│ └─BatchNorm2d: 2-12 [2, 128, 112, 112] 256
│ └─ReLU: 2-13 [2, 128, 112, 112] --
│ └─MaxPool2d: 2-14 [2, 128, 56, 56] --
│ └─Conv2d: 2-15 [2, 256, 56, 56] 295,168
│ └─BatchNorm2d: 2-16 [2, 256, 56, 56] 512
│ └─ReLU: 2-17 [2, 256, 56, 56] --
│ └─Conv2d: 2-18 [2, 256, 56, 56] 590,080
│ └─BatchNorm2d: 2-19 [2, 256, 56, 56] 512
│ └─ReLU: 2-20 [2, 256, 56, 56] --
│ └─Conv2d: 2-21 [2, 256, 56, 56] 590,080
│ └─BatchNorm2d: 2-22 [2, 256, 56, 56] 512
│ └─ReLU: 2-23 [2, 256, 56, 56] --
│ └─Conv2d: 2-24 [2, 256, 56, 56] 590,080
│ └─BatchNorm2d: 2-25 [2, 256, 56, 56] 512
│ └─ReLU: 2-26 [2, 256, 56, 56] --
│ └─MaxPool2d: 2-27 [2, 256, 28, 28] --
│ └─Conv2d: 2-28 [2, 512, 28, 28] 1,180,160
│ └─BatchNorm2d: 2-29 [2, 512, 28, 28] 1,024
│ └─ReLU: 2-30 [2, 512, 28, 28] --
│ └─Conv2d: 2-31 [2, 512, 28, 28] 2,359,808
│ └─BatchNorm2d: 2-32 [2, 512, 28, 28] 1,024
│ └─ReLU: 2-33 [2, 512, 28, 28] --
│ └─Conv2d: 2-34 [2, 512, 28, 28] 2,359,808
│ └─BatchNorm2d: 2-35 [2, 512, 28, 28] 1,024
│ └─ReLU: 2-36 [2, 512, 28, 28] --
│ └─Conv2d: 2-37 [2, 512, 28, 28] 2,359,808
│ └─BatchNorm2d: 2-38 [2, 512, 28, 28] 1,024
│ └─ReLU: 2-39 [2, 512, 28, 28] --
│ └─MaxPool2d: 2-40 [2, 512, 14, 14] --
│ └─Conv2d: 2-41 [2, 512, 14, 14] 2,359,808
│ └─BatchNorm2d: 2-42 [2, 512, 14, 14] 1,024
│ └─ReLU: 2-43 [2, 512, 14, 14] --
│ └─Conv2d: 2-44 [2, 512, 14, 14] 2,359,808
│ └─BatchNorm2d: 2-45 [2, 512, 14, 14] 1,024
│ └─ReLU: 2-46 [2, 512, 14, 14] --
│ └─Conv2d: 2-47 [2, 512, 14, 14] 2,359,808
│ └─BatchNorm2d: 2-48 [2, 512, 14, 14] 1,024
│ └─ReLU: 2-49 [2, 512, 14, 14] --
│ └─Conv2d: 2-50 [2, 512, 14, 14] 2,359,808
│ └─BatchNorm2d: 2-51 [2, 512, 14, 14] 1,024
│ └─ReLU: 2-52 [2, 512, 14, 14] --
│ └─MaxPool2d: 2-53 [2, 512, 7, 7] --
├─AdaptiveAvgPool2d: 1-2 [2, 512, 7, 7] --
├─Sequential: 1-3 [2, 4096] --
│ └─Linear: 2-54 [2, 4096] 102,764,544
│ └─ReLU: 2-55 [2, 4096] --
│ └─Dropout: 2-56 [2, 4096] --
│ └─Linear: 2-57 [2, 4096] 16,781,312
│ └─ReLU: 2-58 [2, 4096] --
│ └─Dropout: 2-59 [2, 4096] --
│ └─ReLU: 2-60 [2, 4096] --
├─Linear: 1-4 [2, 1000] 4,097,000
==========================================================================================
Total params: 143,678,248
Trainable params: 143,678,248
Non-trainable params: 0
Total mult-adds (G): 39.29
==========================================================================================
Input size (MB): 1.20
Forward/backward pass size (MB): 475.41
Params size (MB): 574.71
Estimated Total Size (MB): 1051.33
==========================================================================================
Parameters
for key, val in configs.items():
tmp_model = VGGnet(val, True, 1000)
print(f"ConvNet Configuration {key} Parameters : {sum([p.numel() for p in tmp_model.parameters() if p.requires_grad])}")
ConvNet Configuration A Parameters : 132868840
ConvNet Configuration B Parameters : 133053736
ConvNet Configuration C Parameters : 133647400
ConvNet Configuration D Parameters : 138365992
ConvNet Configuration E Parameters : 143678248
Forward Pass
x = torch.randn(2,3,224,224)
out = model(x)
print(out.shape)
out
torch.Size([2, 1000])
tensor([[-0.0761, -0.1180, 1.7652, ..., 1.2305, -1.1635, 0.3651],
[-0.4145, 0.1778, 0.8768, ..., 0.8948, -0.0290, 0.2008]],
grad_fn=<AddmmBackward0>)