[Neural Networks and Deep Learning] Practice : Planar data classification with one hidden layer

2022, Apr 23    


Planar data classification with one hidden layer

  • for this session, let’s develoop a planar data classifier with shallow neural network with only 1 hidden layer
  • all references come from here!
  • This chapter covers all below,
    • Implement a binary classification neural network with a single hidden layer
    • For binary classification, use activation function as non-linaer function such as tanh or sigmoid
    • Compute the cross entropy loss
    • Implement forward and backward propagation to optimizes the weights
    • Test Model Performance with Different Hidden Unit Size and Datasets


1. Prepare required packages & Load Dataset

import unittest
!pip install testcase --trusted-host pypi.org --trusted-host files.pythonhosted.org
!pip install scikit-learn
# planar_utils.py

def plot_decision_boundary(model, X, Y, title):
    # Set min and max values and give it some padding
    x1_min, x1_max = X[0, :].min() - 1, X[0, :].max() + 1
    x2_min, x2_max = X[1, :].min() - 1, X[1, :].max() + 1
    h = 0.01
    
    # Generate a grid of points with distance h between them
    x1x1, x2x2 = np.meshgrid(np.arange(x1_min, x1_max, h), np.arange(x2_min, x2_max, h))
    
    # Predict the function value for the whole grid
    Z = model(np.c_[x1x1.ravel(), x2x2.ravel()])
    Z = Z.reshape(x1x1.shape)
    
    # Plot the contour and training examples
    plt.contourf(x1x1, x2x2, Z, cmap=plt.cm.Spectral)
    plt.ylabel('x2')
    plt.xlabel('x1')
    plt.scatter(X[0, :], X[1, :], c=Y, cmap=plt.cm.Spectral, edgecolor='black')
    plt.title(title, fontsize=15)
    
def sigmoid(x):
    """
    Compute the sigmoid of x
    Arguments:
    x -- A scalar or numpy array of any size.
    Return:
    s -- sigmoid(x)
    """
    
    s = 1/(1+np.exp(-x))
    return s


def load_planar_dataset():
    np.random.seed(1)
    m = 400         # number of examples
    N = int(m/2)       # number of points per class
    D = 2       # dimensionality (2 nodes in single hidden layer)
    X = np.zeros((m,D))       # data ma trix where each row is a single example 
    Y = np.zeros((m,1), dtype='uint8')       # labels vector (0 for red, 1 for blue)
    a = 4       # maximum ray of the flower

    for j in range(2):
        ix = range(N*j,N*(j+1))
        t = np.linspace(j*3.12,(j+1)*3.12,N) + np.random.randn(N)*0.2       # theta
        r = a*np.sin(4*t) + np.random.randn(N)*0.2       # radius
        X[ix] = np.c_[r*np.sin(t), r*np.cos(t)]
        Y[ix] = j

    X = X.T
    Y = Y.T

    return X, Y


def load_extra_datasets():  
    N = 200
    noisy_circles = sklearn.datasets.make_circles(n_samples=N, factor=.5, noise=.3)
    noisy_moons = sklearn.datasets.make_moons(n_samples=N, noise=.2)
    blobs = sklearn.datasets.make_blobs(n_samples=N, random_state=5, n_features=2, centers=6)
    gaussian_quantiles = sklearn.datasets.make_gaussian_quantiles(mean=None, cov=0.5, n_samples=N, n_features=2, n_classes=2, shuffle=True, random_state=None)
    no_structure = np.random.rand(N, 2), np.random.rand(N, 2)

    return noisy_circles, noisy_moons, blobs, gaussian_quantiles, no_structure

import numpy as np 
import matplotlib.pyplot as plt
from testcase import *
import sklearn, sklearn.datasets, sklearn.linear_model

%matplotlib inline 

np.random.seed(1)  
  • testCases : provides some test examples to assess the correctness of your functions
  • np.random.seed(x) : numpy random seed is a numerical value that generates a pseudo-random numbers. The value in the numpy random seed saves the state of randomness. If we call the seed function using value 1 multiple times, the computer displays the same random numbers.
## Load Dataset

X, Y = load_planar_dataset()
print(X.shape)   # 2x400 matrix
print(Y.shape)   # 1x400 vector
print(Y.shape[1])   # training size

plt.figure(figsize=(8,6))
plt.scatter(X[0, :], X[1, :], c=Y, s=40, edgecolor='black', cmap=plt.cm.Spectral)     
plt.colorbar(label = 'color')     # red for 0, Blue for 1 

image

image


2. Simple Logistic Regression Classifier

  • before jumping right into developing classifier with shallow neural network, let’s firstly make relatively simple classifier with logistic regression model
  • Through this, you can compare the performances of those two different algorithms
  • Use convenient sklearn packages to import lr classifer
import warnings
from sklearn.exceptions import DataConversionWarning
warnings.filterwarnings(action='ignore', category=DataConversionWarning)
lr_clf = sklearn.linear_model.LogisticRegressionCV()
lr_clf.fit(X.T, Y.T)    # X.T : (D, m)

image

# plot decision boundary for separating classes (0, 1)

plot_decision_boundary(lambda x : lr_clf.predict(x), X, Y, "Logistic Regression")   

image

# Prediction Accuracy 
from sklearn.metrics import accuracy_score

LR_result = lr_clf.predict(X.T)
print("Accuracy with Logistic Regression Classifier : {0}".format(accuracy_score(Y[0,:], LR_result)))


# instead of using sklearn library, you can also calcuate accuracy with code below
# dot result gives 1 only when Y and predicted value equals each other (either 0 or 1)

print(float((np.dot(Y, LR_result) + np.dot(1 - Y,1 - LR_result)) / float(Y.size)))   

image

  • The performance of logistic regression algorithm represented as accuracy score was not great, which is 47%.
  • This result implies that planar dataset is not linearly separable
  • So you definitely need another algorithm, hope neural network with a single layer would work better


3. Define Neural Network Model Structure

  • Now, let’s finally make Neural Network model with one hidden layer to predict classes for planar dataset
  • Here is the representation of our model


Gradient Descent Loop

  • Implement forward propagation –> predict
  • Compute loss
  • Implement backward propagation to get the gradients
  • Update parameters (gradient descent)


1) Forward Propagation

  • 400 examples with two features, x1 and x2
  • Single hidden layer contains 4 training nodes with identical activation function tanh
  • The activation function of output layer is sigmoid as it should return either 0 or 1 (threshold 0.5)


2) Computing Cost

  • Cost function of NN equals to that of logstic regression
  • to make convex function


3) Back-Propagation for Gradient Descent

  • J = J - α * dJ/dw (α, learning rate)
  • For Gradient Descent, you need to calculate the partial derivative of Cost(L) by the parameter of interest
  • Then, if you want to compute partial derivative of L by w (dw), then you firstly have to compute it by a (da), then by z (dz), and then finally by w,
  • Same applies to parameter b
    • dL/dw (dW) = dL/da (da) * da/dz (dz) * dz/dw
    • How to back proagate with sigmoid activation function
  • Derivative of Multiple Activation Functions : da/dz (dz)
    • here are the graphs and derivatives of various types of activation function


Summary of Gradient Descent for Our Model

  • we have one input layer (400 examples with 2 features) : x (2, 400)
  • hidden layer (4 nodes with ‘tanh’ activation function) : W[1] (4, 2), Z[1], b[1] –> a[1]
  • one output layer (one node with sigmoid activation function) : W[2], Z[2], b[2] –> a[2]
    • g[1]’(z[1]) here is 1-a[1]^2 (as g(z) is tanh(z))
    • 1 - np.pow(A1, 2)


4. Build Model

  • We’ve just defined the model structure, decision boudary, cost function and the loop process for gradient descent!
  • Let’s build functions named ‘nn_model()’ to realize desired neural network


Set sizes of each layer, input, hidden and output


def layer_sizes(X, Y)s:
    nx = X.shape[0]    # (2, 400) -> 2 : size of input layer
    nh = 4             # nodes 9
    ny = Y.shape[0]    # (1, 400) -> 1 : size of output layer
    
    return (nx, nh, ny)


Initialize model parameters randomly

def initialize_params(nx, nh, ny):
    np.random.seed(2)     # set random state so that our outcomes have same value even with randomized initialization
    
    W1 = np.random.randn(nh, nx)*0.01 # from input to hidden layer (4, 2)
    b1 = np.zeros((nh, 1))     # (4, 1)
    W2 = np.random.randn(ny, nh)*0.01   # (1, 4)
    b2 = np.zeros((ny, 1))     # (1, 1)
    
    # confirm that each param has the right shape
    assert(W1.shape==(nh, nx) and b1.shape==(nh, 1) and W2.shape==(ny, nh) and b2.shape==(ny, 1))
    
    params = {'W1' : W1, 'b1' : b1, 'W2' : W2, 'b2' : b2}
    
    return params


Forward Propagation

def fp(X, params):
    W1, b1, W2, b2 = (params['W1'], params['b1'], params['W2'], params['b2']) 
    
    Z1 = np.dot(W1, X) + b1  # (4, 2) x (2, 400) = (4, 400) + b(4, 1) broadcasting
    A1 = np.tanh(Z1)
    Z2 = np.dot(W2, A1) + b2  # (1, 4) x (4, 400) = (1, 400) + b(1, 1) broadcasting
    A2 = sigmoid(Z2)          # probability that x is 1
    
    assert(A2.shape == (1, X.shape[1]))
    
    cache = {'Z1' : Z1, 'A1' : A1, 'Z2' : Z2, 'A2' : A2}
           
    return cache


Calculate Cost

# now calculate the cost (amount of deviation of A2 from Y)
# cost function (J(a))
    # J= −1/m (i=1~i=400∑( y(i)log(a[2](i)) + (1−y(i))log(1−a[2](i)) )

def compute_cost(A2, Y):
    m = Y.shape[1]
    
    tmp = np.multiply(np.log(A2), Y) + np.multiply(np.log(1 - A2), 1-Y)
    cost = -np.sum(tmp)/m
    cost = float(np.squeeze(cost))    # remove axis whose size is 1 
    
    assert(isinstance(cost, float))
    return cost


Compute gradient descent

# now let's compute gradient descent of neural network with back-propagation

def bp(params, cache, X, Y):
    """"
    Returns: grads -- gradients with respect to different parameters (dW1, dW2, db1, db2)
    """
    m = X.shape[1]   # 400
    
    W1, b1, W2, b2 = (params['W1'], params['b1'], params['W2'], params['b2'])     
    Z1, A1, Z2, A2 = (cache['Z1'], cache['A1'], cache['Z2'], cache['A2'])
    
    dZ2 = A2 - Y       # (1, 400)
    dW2 = (1/m)*np.dot(dZ2, A1.T)      # (1, 400) x (400, 4) --> (1, 4)
    db2 = (1/m)*np.sum(dZ2, axis=1, keepdims=True)       # (1, 1)
    
    dZ1 = np.multiply(np.dot(W2.T, dZ2), 1 - np.power(A1, 2))   # (4, 1) x (1, 400) * (4, 400) --> (4, 400)
    dW1 = (1/m)*np.dot(dZ1, X.T)        # (4, 400) * (400, 2)  --> (4, 2)
    db1 = (1/m)*np.sum(dZ1, axis=1, keepdims=True)    # (1, 1) 
    
    grads = {'dW1': dW1, 'dW2': dW2, 'db1' : db1, 'db2': db2}
    
    return grads


Summary -


Update Parameters

def update_params(params, grads, lr = 1.2):
    W1, b1, W2, b2 = (params['W1'], params['b1'], params['W2'], params['b2'])     
    dW1, db1, dW2, db2 = (grads['dW1'], grads['db1'], grads['dW2'], grads['db2'])  
    
    W1 -= lr*dW1   # (4, 2)
    W2 -= lr*dW2    # (1, 4)
    b1 -= lr*db1    # (4, 1)
    b2 -= lr*db2    # (1, 1)
    
    updated_params = {'W1' : W1, 'b1' : b1, 'W2' : W2, 'b2' : b2}
    
    return updated_params


Build NN model

def nn_model(X, Y, nh, num_iter, print_cost):
    """
    Arguments:
    num_iterations -- Number of iterations in gradient descent loop
    print_cost -- if True, print the cost every 1000 iterations
    
    Returns:
    parameters -- parameters learnt by model (fp -> compute cost & bp -> update)
    """
    
    np.random.seed(3)
    nx, _, ny = layer_sizes(X, Y)
    
    params = initialize_params(nx, nh, ny)
    min_cost = float('inf')
    learning_curve = []
    
    for i in range(num_iter):    
        cache = fp(X, params)
        A2 = cache['A2']
        cost = compute_cost(A2, Y)
        grads = bp(params, cache, X, Y)
        params = update_params(params, grads)
        
        if cost < min_cost:
            min_cost = cost
            best_params = params
        
        if print_cost and i%10 == 0:
            learning_curve.append(cost)
            
            if i%1000 == 0:
                print("Cost after iterations {0} : {1}".format(i, cost))
                
    #   print(cost)
    
    return params, best_params, min_cost, learning_curve
    
  • Now Finally, we’ve made our NN model!
  • From now on, we will gonna predict the classes of examples (either 0 or 1) using our model
  • Decision Rule


Predict classes of X

def predict(best_params, X):
    """
    Using the learned parameters, predicts a class for each example in X
    
    Arguments:
    parameters -- python dictionary containing your parameters
    X -- input data of size (n_x, m)
    
    Returns
    predictions -- vector of predictions of our model (red: 0 / blue: 1)
    """
    
    cache = fp(X, best_params)
    A2 = cache['A2']    # predicted values  (not classes)
    
    preds = A2 > 0.5   # (1, 400)
    
    return preds
# Result 
params, best_params, min_cost, learning_curve = nn_model(X, Y, 4, 10000, 1)
# plt.figure(figsize=(8, 5))
# plt.axis([0, 1000, 0, max(learning_curve)])        # [xmin, xmax, ymin, ymax]
plt.plot(learning_curve)
plt.xlabel("Iterations")
plt.ylabel("Cost")
plt.title("Learning Curve (Cost by Iterations)", size=15)


image

image


# x : comes with shape (400, 2)
plt.figure(figsize=(8, 6))
plot_decision_boundary(lambda x : predict(best_params, x.T), X, Y, "Neural Network Model with a Single Hidden Layer")

image

def accuracy_score(preds, Y, nh):
    m = Y.shape[1]
    error = float((np.dot(1-Y, preds.T) + np.dot(Y, 1-preds.T))/m)   
    # same as np.sum(np.multiply(1-Y, preds) + np.multiply(Y, 1-preds))/m
    print("Accuracy with Neural Network with 1 Hidden Layer with {0} Units: {1} %".format(nh, (1-error)*100))
    
    return (1-error)*100
_, best_params, _, _ = nn_model(X, Y, 4, 10000, 0)
preds = predict(best_params, X)
m = X.shape[1]

accuracy_score(preds, Y, 4)


image

  • From our NN model (1 hidden layer with 4 units), we’ve just gained 90% accuracy
  • Previously, accuracy from Logistic Regression Classifier was 47%
  • Now you can see NN with only one hidden layer can outperform Logistic regression
  • Our NN model has learnt the leaf patterns of the flower, which shows NN can learn even highly non-linear decision boundaries, unlike logistic regression.


5. Compare Accuracy of NN with Different Unit Sizes of Hidden Layer

plt.figure(figsize=(12, 24))
h_sizes = [1, 4, 8, 12, 16, 20, 50]

for i, nh in enumerate(h_sizes):
    plt.subplot(4, 2, i+1)
    _, best_params, _, _ = nn_model(X, Y, nh, 10000, 0)
    preds = predict(best_params, X)
    plot_decision_boundary(lambda x : predict(best_params, x.T), X, Y, "Hidden Units {0}".format(nh))
    accuracy_score(preds, Y, nh)


image

image


Interpretations :

  • The larger models (with more hidden units) are able to fit the training set better, until eventually the largest models overfit the data.
  • The best hidden layer size seems to be around nh 8.
  • Indeed, values greater than 8 seem to incur noticable overfitting as shown in decision boundary contour plot.
  • You will also learn later about regularization, which lets you use very large models (such as n_h = 50) without much overfitting.


6. Performance on Other Datasets

  • Now, let’s test our model performance on other 4 datasets
  • Unit size of 1 Hidden layer will be fixed as 8, which was figured as best unit size that prevents overfitting
datasets = dict()
datasets['noisy_circles'], datasets['noisy_moons'], datasets['blobs'], datasets['gaussian_quantiles'], _ = load_extra_datasets()
def accuracy_score_2(preds, Y, d):
    m = Y.shape[1]
    error = float((np.dot(1-Y, preds.T) + np.dot(Y, 1-preds.T))/m)   
    # same as np.sum(np.multiply(1-Y, preds) + np.multiply(Y, 1-preds))/m
    print("Accuracy for Dataset <{0}> with Unit Size 8 : {1}%".format(d, (1-error)*100))
    
    return (1-error)*100
plt.figure(figsize=(12, 24))

for i, d in enumerate(datasets.keys()):
    plt.subplot(3, 2, i+1)
    X, Y = datasets[d]
    # print(X.shape, Y.shape)
    X, Y = X.T, Y.reshape(1, Y.shape[0])
    if d == 'blobs':
        Y = Y%2
    
    _, best_params, _, _ = nn_model(X, Y, 8, 10000, 0)
    preds = predict(best_params, X)
    plot_decision_boundary(lambda x : predict(best_params, x.T), X, Y, "Datasets : {0} with Unit size 8".format(d))
    accuracy_score_2(preds, Y, d)

image

image

  • Performance of our NN model quite differs by the datasets
  • But, definitely can tell that our model can learn highly non-linear, complex decision boundaries with pretty fine accuracy