[Neural Networks and Deep Learning] Building Deep Neural Network Step by Step

- Vectorized Implementation for Propagation

As we’ve seen from the shallow neural network that we had built previously, vectorized implemantation allows us to propagate through multiple examples at a time without using explicit for-loop in algorithm, which can save the learning time significantly.
When implementing vectorization for your model, making sure that the dimensions of matrices that are used to be consistent is really important for debugging
Generalization
Keeping straight the dimensions of various matrices and vectors you’re working with will help you to eliminate some classes of bugs in your model.

We’ve all been hearing that neural networks that are deep (with lots of hidden layers) work better than the ones with shallow representation.
But WHY is it so?
For this chapter, let’s go through a couple of examples to gain some intuition for why deep is better than shallow

Suppose you have an face recognizing algorithm with 20 hidden layers
If you input a picture of a face, then the first layer will act as somewhat a feature detector or edge detector (will be dealt in depth at later courses about CNN) by grouping together the pixels to form edges
20 hidden layers might then be trying to figure out the orientations of those edges in the image to structure the image horizontally and vertically and group the edges to form small parts of a face
As we go deeper down the layers of model, finally by putting together each different part of a face, like an eye or a nose, it can then try to recognize or even detect different types of faces
So intuitively, you can think of the earlier layers of the neural network as detecting simple functions, like edges. And then composing them together in the later layers of a neural network so that it can learn more and more complex functions

There are functions you can compute with Small L-layer Deep neural networks which, otherwise, shallower networks require exponentially more hidden units to compute
Suppose you’re trying to compute an exclusive OR (XOR) problem for n x features (x1 XOR x2, x2 XOR x3, x3 XOR x4 … )
- the depth of the network to build XOR tree for n x features will be on the order of log n (O(logn))
  - you only need to iterate calculations for log2n here (technically, you need a couple layers to copmute 1 XOR function - (x1 + x2)*(not x1 + not x2))
  - but still it’s a relatively small circuit (still complexity is O(logn))
- But if you’re not allowed to use neural network with multiple layers, then you need 2^n units because you need to enumerate 2^n possible combinations (O(2^n)).
  - 2 units needed for one x feature
- This shows that deep hidden layers allow you to compute exactly the same funciton with relatively smaller hidden unit size compared to the shallow neural networks
- Large unit size requires more calculations, which significantly lowers the learning efficiency of an algorithm

repeat forward propagation and backward propagation untill it reaches the global optimum
- Forward Propagation
- Backward Propagation

Parameters : w[l], b[l]
- these parameters are learnt through the learning process of DNN such as Gradient Descent
Hyperparameters : learning rate (α), # of iterations, # of hidden layers, hidden unit size for each layer, chocie of activation function, momentum, minibatch size, regularization parameters … etc.
- for the case of hyperparameters, its not something that you can learn through algorithm
- its just somthing that you should choose empirically by applying every appropriate combinations of hyperparameters
- empirical process : a fancy way of saying that you try out a lot of things and figure out the best options just like you’re doing somewhat experiments