2022, Jun 05    

Building Deep Neural Network : Step by Step

  • We’ve previoulsy made shallow planar classifier (with 1 hidden layer). For this week, We will build a real “Deep” neural network with as many layers as we want!
  • This practice covers all below,
    • Use ReLu for all layers except the output layer with sigmoid activation function
    • Build multiple hidden layers (at least more than 1)
    • Implement easy-to-use neural network class

0. Outline of Practice

  • To build our neural network, we will define several “helper functions”, which will be used later for building 2-layer neural network and L-layer neural network
  • Types of helper functions that will be defined
    • Intialize Parameters
    • Forward Propagation (linear, Relu)
    • Compute Cost
    • Backward Propagation (linear, Relu)
    • Update Parameter (Gradient Descent)
  • Summary of model
    • As an activation function, Relu for hidden layers (L-1 layers) and Sigmoid for output layer

1. Load Packages

  • TestCases : test cases to assess the correctness of your functions, got this from here
  • Activation Function (ReLu, Sigmoid) and its Derivative by Z (for Back-Propagation)
  • TestCases

def linear_forward_test_case():
    X = np.array([[-1.02387576, 1.12397796],
                  [-1.62328545, 0.64667545],
                  [-1.74314104, -0.59664964]])
    W = np.array([[ 0.74505627, 1.97611078, -1.24412333]])
    b = np.array([[1]])
    A = np.random.randn(3,2)
    W = np.random.randn(1,3)
    b = np.random.randn(1,1)
    return A, W, b

def linear_activation_forward_test_case():
    X = np.array([[-1.02387576, 1.12397796],
                  [-1.62328545, 0.64667545],
                  [-1.74314104, -0.59664964]])
    W = np.array([[ 0.74505627, 1.97611078, -1.24412333]])
    b = 5
    A_prev = np.random.randn(3,2)
    W = np.random.randn(1,3)
    b = np.random.randn(1,1)
    return A_prev, W, b

def L_model_forward_test_case():
    X = np.array([[-1.02387576, 1.12397796],
                  [-1.62328545, 0.64667545],
                  [-1.74314104, -0.59664964]])
    parameters = {'W1': np.array([[ 1.62434536, -0.61175641, -0.52817175],
                                  [-1.07296862,  0.86540763, -2.3015387 ]]),
                  'W2': np.array([[ 1.74481176, -0.7612069 ]]),
                  'b1': np.array([[ 0.],
                                  [ 0.]]),
                  'b2': np.array([[ 0.]])}
    X = np.random.randn(4,2)
    W1 = np.random.randn(3,4)
    b1 = np.random.randn(3,1)
    W2 = np.random.randn(1,3)
    b2 = np.random.randn(1,1)
    parameters = {"W1": W1,
                  "b1": b1,
                  "W2": W2,
                  "b2": b2}
    return X, parameters

def compute_cost_test_case():
    Y = np.asarray([[1, 1, 1]])
    aL = np.array([[.8,.9,0.4]])
    return Y, aL

def linear_backward_test_case():
    dZ = np.random.randn(2,2)
    A = np.random.randn(3,2)
    W = np.random.randn(2,3)
    b = np.random.randn(2,1)
    linear_cache = (A, W, b)
    return dZ, linear_cache

def linear_activation_backward_test_case():
    dA = np.random.randn(1,2)
    A = np.random.randn(3,2)
    W = np.random.randn(1,3)
    b = np.random.randn(1,1)
    Z = np.random.randn(1,2)
    linear_cache = (A, W, b)
    activation_cache = Z
    linear_activation_cache = (linear_cache, activation_cache)
    return dA, linear_activation_cache

def L_model_backward_test_case():
    AL = np.random.randn(1, 2)
    Y = np.array([[1, 0]])

    A1 = np.random.randn(4,2)
    W1 = np.random.randn(3,4)
    b1 = np.random.randn(3,1)
    Z1 = np.random.randn(3,2)
    linear_cache_activation_1 = ((A1, W1, b1), Z1)

    A2 = np.random.randn(3,2)
    W2 = np.random.randn(1,3)
    b2 = np.random.randn(1,1)
    Z2 = np.random.randn(1,2)
    linear_cache_activation_2 = ( (A2, W2, b2), Z2)

    caches = (linear_cache_activation_1, linear_cache_activation_2)

    return AL, Y, caches

def update_parameters_test_case():
    W1 = np.random.randn(3,4)
    b1 = np.random.randn(3,1)
    W2 = np.random.randn(1,3)
    b2 = np.random.randn(1,1)
    parameters = {"W1": W1,
                  "b1": b1,
                  "W2": W2,
                  "b2": b2}
    dW1 = np.random.randn(3,4)
    db1 = np.random.randn(3,1)
    dW2 = np.random.randn(1,3)
    db2 = np.random.randn(1,1)
    grads = {"dW1": dW1,
             "db1": db1,
             "dW2": dW2,
             "db2": db2}
    return parameters, grads

  • Activation Function
def sigmoid(Z):
    Implement sigmoid activation function for output layer
    A = 1/(1+np.exp(-Z))
    return A, Z

def relu(Z):
    Returns Z if Z >= 0 else, 0
    A = np.maximum(0, Z)
    assert(A.shape == Z.shape)
    return A, Z

def relu_bp(dA, Z):
    Implement backprop for dA (dA/dZ = 1 if Z >= 0, 0 otherwise) at a single ReLu unit
    Return dZ 
    dZ = np.array(dA, copy=True)  
    assert(dZ.shape == Z.shape)
    dZ[Z <= 0] = 0     # derivative of ReLu returns 0 if x < 0 and 1 if x >= 0 
    assert(dZ.shape == Z.shape)
    return dZ

def sigmoid_bp(dA, Z):
    backprop for single sigmoid activation unit
    A = 1/(1 + np.exp(-z))
    dZ = dA*A*(1-A)
    assert (dZ.shape == Z.shape)
    return dZ

2. Random Initialization

  • this section, we will define 2 helper functions, first one is for intializing parameters for 2-layer model and second one extends this intializing process to L layers

2.1 Two-Layer Neural Network

  • The model’s structure is: LINEAR (Wx + b) -> RELU (Activation function) -> LINEAR (Wx + b) -> SIGMOID (Activation function).
  • Use np.random.randn(shape)*0.01 with the correct shape for random initialization of weight matrices (W).
  • Use zero initialization for the biases (b). Use np.zeros(shape=())
def init_params(nx, nh, ny):
    nx : size of the input layer
    nh : size of the hidden layer
    ny : size of the output layer
    W1 : (nh, nx)
    b1 : (nh, 1)
    W2 : (ny, nh)
    b2 : (ny, 1)
    W1 = np.random.rand(nh, nx)*0.01
    b1 = np.zeros(shape=(nh, 1))
    W2 = np.random.rand(ny, nh)*0.01
    b2 = np.zeros(shape=(ny, 1))
    assert(W1.shape == (nh, nx))
    assert(b1.shape == (nh, 1))
    assert(W2.shape == (ny, nh))
    assert(b2.shape == (ny, 1))
    params = {"W1" : W1,
              "b1" : b1,
              "W2" : W2,
              "b2" : b2}
    return params

params = init_params(4, 5, 2)
for key, val in params.items():
    print("{0} : {1}".format(key, val))


2.2 L-layer Neural Network

  • initialization process for deep L-layer network is much more complex than shallow model as it has to keep track of the dimensions of all weights and bias matrices for all L-1 layers

  • so we will adapt for-loop to randomize parameters of each layer with the right dimension

def init_params_L(dims):
    dims : list taht contains the dimensions (n[i], n[i-1]) of every layer in network
    params : python dict containing randomized initial parameters (W1, b1, W2, b2, ... , W[L-1], b[L-1])
    params = dict()
    L = len(dims)    # includes input layer (technically, L+1)
    for i in range(1, L):
        params["W{0}".format(i)] = np.random.rand(dims[i], dims[i-1])*0.01
        params["b{0}".format(i)] = np.zeros(shape=(dims[i], 1))
        assert(params["W{0}".format(i)].shape == (dims[i], dims[i-1]))
        assert(params["b{0}".format(i)].shape == (dims[i], 1))
    return params

dims = [3, 4, 5, 2]    # nx : 3, nh1 : 4, nh2 : 5, nh3(output layer) : 2 
params = init_params_L(dims)

for key, val in params.items():
    print("{0} :\n {1}".format(key, val))


3. Forward Propagation

  • Now, we’ve just initialized all of the parameters in L-model.
  • Next step, we will implement forward propagation modules that include 2 processes.
    • linear propagation : calculates Z[i] = W[i]*A[i-1] + b[i]
      • np.dot(W, A) + b
    • linear-activation propagation : A[i] = Act_Func(Z[i])
      • RELU(Z) : Z if Z >= 0, else 0
      • Sigmoid(Z) : 1/(1 + np.exp(-Z))
  • Finally, we will define a new helper functon that implements linear-activation propagation for every layer of our deep L-layer model at once

3.1 Linear Propagation

def linear_fp(A, W, b):
    A : output of previous layer (n[i-1], m)
    W : weight matrix of current layer (n[i], n[i-1])
    b : bias matrix of current layer (n[i], 1)
    Z : result of linear propagation = W*A + b
    cache : python dict containing A, W, b - stored for back-propagation
    Z = np.dot(W, A) + b
    assert(Z.shape == (W.shape[0], A.shape[1]))
    cache = (A, W, b)
    return Z, cache

A, W, b = linear_forward_test_case()   # see 1. Packages 
# A : (3, 2) 
# W : (1, 3)
# b : (1, 1)

Z, cache = linear_fp(A, W, b)
# expected Z shpae : (1, 2)

print("Z : {0}".format(Z))


3.2 Linear-Activation Propagation

  • this helper function calculates both linear and activation propagation
  • here, we will use previously-defined activation function, sigmoid and Relu
def linear_activation_fp(activation, A_prev, W, b):
    Caculates both linear and activation propagation
    A_prev : output of previous layer (n[i-1], m)
    W : weight matrix of current layer (n[i], n[i-1])
    b : bias matrix of current layer (n[i], 1)
    A : output of current layer (n[i], m)
    Z, linear_cache = linear_fp(A_prev, W, b)   # linear_cache : A_prev, W, b
    activation_cache = Z
    if activation == "relu":
        A, _ = relu(Z)
    elif activation == "sigmoid":
        A, _ = sigmoid(Z)
    assert(Z.shape == (W.shape[0], A_prev.shape[1]))
    assert(A.shape == (W.shape[0], A_prev.shape[1]))
    return A, linear_cache, activation_cache

A_prev, W, b = linear_activation_forward_test_case()

A, lin_cache, act_cache = linear_activation_fp("relu", A_prev, W, b)
print("--- ReLu Activation ---\nA : {0}\nZ (activation_cache) :\n {1}".format(A, act_cache))


A, lin_cache, act_cache = linear_activation_fp("sigmoid", A_prev, W, b)
print("--- Sigmoid Activation ---\nA : {0}\nZ (activation_cache) :\n {1}".format(A, act_cache))


3.3 Forward Propagation for L-Layer model

  • Finally, we can implement previously defined linear_activatoin_fp function to every layer of our deep model at once using for-loop
  • As an activaiton function, we will use relu for 1~L-1 layer and sigmoid for L layer, which is our final output layer
  • Also, through this process, we will store all caches (A_prev, W, b and Z) from every layer into one list named as “caches” (results of fp for every L layer)
def L_model_fp(X, params):
    Implement linear-activation forward propagation for L-layer model 
    Layer 1~L-1 : relu
    Layer L : sigmoid
    X : training examples (nx, m) 
    params : initialized params containing W1, b1 ~ W[L], b[L] 
    AL : final output from L layer
    caches : list of caches from every layer
             each cache has a form of (linear_cahce(A_prev, W, b), activation_cache(Z))
             index 0 ~ L-2 : activation as relu
             index L-1 : activation as sigmoid
    caches = []
    L = len(params)//2 
    A_prev = X
    for i in range(1, L+1):
        if i == L:
            AL, lin_cache, act_cache = linear_activation_fp("sigmoid",
            caches.append((lin_cache, act_cache))
            A, lin_cache, act_cache = linear_activation_fp("relu", 
            caches.append((lin_cache, act_cache))
            A_prev = A
    assert(AL.shape == (1, X.shape[1]))
    return AL, caches

X, params = L_model_forward_test_case()  # X : (4, 2) / 2 layers
AL, caches = L_model_fp(X, params)

print("Final Ouptut AL : {0}".format(AL))


for i, (lin, act) in enumerate(caches):
    print("-- Cache from Layer {0} --".format(i+1))
    print("A[{0}] :\n{2}\nW[{1}] :\n{3}\nb[{1}] :\n{4}".format(i, i+1, lin[0], lin[1], lin[2]))
    print("Z[{0}] :\n{1}".format(i+1, act))

print("Length of Caches : {}".format(len(caches)))


4. Cost Funciton

  • Cost function is Cross-Entropy Cost that looks like below (same as we use all the time)

  • let’s make the helper function that computes cost with python

def compute_cost(AL, Y):
    cost : cross-entropy cost
    m = Y.shape[1]
    cost = (-1/m)*np.sum(np.multiply(Y, np.log(AL)) + np.multiply(1-Y, np.log(1-AL)))   # element-wise multiplication
    cost = np.squeeze(cost)    # make sure that cost has numeric value not matrix : eliminats axis whose size is 1
    assert(cost.shape == ())
    return cost

Y, AL = compute_cost_test_case()

print("Cost for test case : {}".format(compute_cost(AL, Y)))


5. Backward Propagation

  • Finally, we’ve built pretty much all helper functions including initializing parmaters, forward propagation and computing cost fucnton
  • One last left is Backwrad Propagation that is used to update paramters (W[l], b[l]) untill the model reaches to global optimum (at least close to it)
  • here’s the simplified diagram of backward propagation for L-layer model (2 layer in example)

  • There are largely three steps to propagate backwardly
    • LINEAR : dW[l], db[l], dA[l-1]
    • LINEAR -> ACTIVATION : dZ[l]
      • derivative of Relu funciton for 1~L-1 layer
      • derivative of Sigmoid function for L layer (output)

    • note that dZ[l] is needed to calculate dW[l], db[l], dA[l-1] -> calculation of dZ[l] should precedes before dW[l], db[l], dA[l-1]

5.1 Linear Backward Propagation

  • linear bp function computes derivative of Z[l] (W[l]*A[l-1] + b[l]) with respect to W[l], A[l-1], b[l]
  • make sure that derivative should keep same dimension with its original matrix
def linear_bp(dZ, cache):
    Implement linear back-propagation for a single layer
    dZ (n_cur, m) : gradient of cost with respect to Z (lienar output)
                        gained from linear-activation backward
    cache : products from forward propagation containing (A_prev, W, b) and Z
    dA_prev (n_prev, m), dW (n_cur, n_prev), db (n_cur, 1) : gradient of cost with respect to A_prev, W, b respectively
    A_prev, W, b = cache
    m = A_prev.shape[1]
    dW = np.dot(dZ, A_prev.T) / m    # (n_cur, m) x (m, n_prev) = (n_cur, n_prev)
    db = np.squeeze(np.sum(dZ, axis=1, keepdims=True) / m)    # array that has length n_cur / axis = 0 along the row, 1 along the column
    dA_prev = np.dot(W.T, dZ)   # (n_prev, n_cur) x (n_cur, m) = (n_prev, m)
    assert(dA_prev.shape == A_prev.shape)
    assert(dW.shape == W.shape)
    assert(len(db) == dZ.shape[0])
    return dA_prev, dW, db

dZ, linear_cache = linear_backward_test_case()   

# dZ : (2, 2) / linear_cache - A_prev : (3, 2), W : (2, 3), b : (2, 1)

dA_prev, dW, db = linear_bp(dZ, linear_cache)

print("dA_prev :\n{0}".format(dA_prev))
print("dW :\n{0}".format(dW))
print("db :\n{0}".format(db))


5.2 Linear-Activation Backward Propagation

  • We’ve built linear-backward propagation helper function for dW, dA_prev, db
  • Now using this linear bp function and previously defined sigmoid and relu bp fucntions, we will write linear-activation backward propagation function, which computes two types of activation function
    • Relu for 1~L-1 layer : dZ = 1 if Z > 0, else dZ = 0
      • dZ = relu_bp(dA, Z)
    • Sigmoid for L layer : dZ = A(1-A)
      • dZ = sigmoid_bp(dA, Z)
  • order of back-propagation is LINEAR-ACTIVATION (dZ) -> LINEAR (dA_prev, dW, db)
def linear_activation_bp(activation, dA, cache):
    Implement relu-backward for 1~L-1 layer and sigmoid-backward for L layer (output) 
    dA : post-activation gradient of cost with respect to A (A for current layer)
    cache : tuple of caches (linear_cache, activation_cache) stored from linear-activation forward propagtion
    activation : type of activation function at current layer - define the form of dZ
    dW : (n_cur, n_prev)
    dA_prev : (n_prev, m)
    db : list that has length of n_cur (squeezed to eliminate the axis of size 1)
    linear_cache, Z = cache   # linear_cache, activation_cache
    if activation == "relu":
        dZ = relu_bp(dA, Z)
    elif activation == "sigmoid":
        dZ = sigmoid_bp(dA, Z)
    dA_prev, dW, db = linear_bp(dZ, linear_cache)

    return dA_prev, dW, db

dA, cache = linear_activation_backward_test_case()   
# dA : (1, 2) / cache : (linear_cache(A, W, b), act_cache(Z))

dA_prev, dW, db = linear_activation_bp("relu", dA, cache)
print("-- Relu Activaiton --")
print("dA_prev :\n{0}".format(dA_prev))
print("dW :\n{0}".format(dW))
print("db :\n{0}".format(db))


dA_prev, dW, db = linear_activation_bp("sigmoid", dA, cache)
print("-- Sigmoid Activaiton --")
print("dA_prev :\n{0}".format(dA_prev))
print("dW :\n{0}".format(dW))
print("db :\n{0}".format(db))


5.3 Backward Propagation for L-layer Model

  • Finally, we will implement the backward propagation for the whole network.
  • we will use “caches” which is the list of caches from all layers that we’ve gained through the process of forward propagation
  • Image below shows the simplified diagram of backward pass

  • before starting L-layer back-propagation, we need to calculate dA[L], which is the initial input of back-propagation
  • dA[L] is the drivative of Cost with respect to final forward-propagation output A[L]
    • dA[L] = - (np.divide(Y, AL) - np.divide(1-Y, 1-AL))
    • you can easily prove this equation by taking partial derivative to our cross-entropy cost function with respect to AL
def L_model_bp(AL, Y, caches):
    Implement backward propagation : 
    [LINEAR-ACTIVATION (sigmoid)] -> [LINEAR] -> ([LINEAR-ACTIVATION (relu)] -> [LINEAR]) * L-1
    AL : initial input of bp (1, m), final post-activation output of forward propagation
    Y : true label (1, m), required here to derive dAL (-Y/AL + 1-Y/1-AL)
    caches : A_prev, W, b (linear_cache), Z (activation_cahce) from every layer, stored during forward propagation
    grads : python dictionary with gradients of all parameters (dW[1], db[1] ... dW[L], db[1])
    m = Y.shape[1]
    L = len(caches)
    dAL = -(np.divide(Y, AL) - np.divide(1-Y, 1-AL))   # (1, m)
    grads = dict()
    dA = dAL
    for i in range(1, L+1):
        if i == 1:
            dA_prev, dW, db = linear_activation_bp('sigmoid', dA, caches[L-i])
            dA_prev, dW, db = linear_activation_bp('relu', dA, caches[L-i])
        grads["dA{0}".format(L-i)] = dA_prev
        grads["dW{0}".format(L-(i-1))] = dW
        grads["db{0}".format(L-(i-1))] = db
        dA = dA_prev
    return grads

AL, Y, caches = L_model_backward_test_case()   
# 2 Layer
# m : 2
# unit size of layer 1 : 3
# unit size of layer 2 : 1

grads = L_model_bp(AL, Y, caches)
L = len(caches)

for i in range(1, L+1):
    print("-- Layer {0} --".format(i))
    if i == 1:
        print("dX :\n{0}".format(grads["dA{0}".format(i-1)]))
    else : 
        print("dA{0} :\n{1}".format(i-1, grads["dA{0}".format(i-1)]))
    print("dW{0} :\n{1}".format(i, grads["dW{0}".format(i)]))
    print("db{0} :\n{1}".format(i, grads["db{0}".format(i)]))    


6. Update Parameters

  • Now it’s almost done. Only one left is a function to update parameters with the gradient values from grads, which is a list of gradients of each parameter that we got from L_model_bp function
  • This step is called “Gradient Descent”, which means we repeatedly update paramters with its gradient against cost untill the model reaches to global optimum (gradient goes close to zero)
  • We also need to set proper α, learning rate to adjust the speed of learning so that our algorithm doesn’t diverge, but converge

def update_params(params, grads, lr):
    Update parameters using gradient descent
    params : python dict containing your parameters
    grads : python dict containing gradients of all parameters
    lr : learning rate α
    L = len(params)//2
    for i in range(1, L+1):
        params["W{0}".format(i)] -= lr*grads["dW{0}".format(i)]
        params["b{0}".format(i)] -= lr*grads["db{0}".format(i)]
    return params 

  • Now we’ve made all the functions required for building deep L-layer model (no matter how big it is!) step by step
  • In the next practice, we will put all these fucntions together to build two types of models:
    • 2-layer neural network
    • L-layer neural network
  • We will use these two models to classifiy cat vs non-cat images (as we did with logistic regression classifier) and compare the performance of two models