[Neural Networks and Deep Learning] Practice : Planar data classification with one hidden layer
2022, Apr 23
Planar data classification with one hidden layer
- for this session, let’s develoop a planar data classifier with shallow neural network with only 1 hidden layer
- all references come from here!
- This chapter covers all below,
- Implement a binary classification neural network with a single hidden layer
- For binary classification, use activation function as non-linaer function such as tanh or sigmoid
- Compute the cross entropy loss
- Implement forward and backward propagation to optimizes the weights
- Test Model Performance with Different Hidden Unit Size and Datasets
1. Prepare required packages & Load Dataset
import unittest
!pip install testcase --trusted-host pypi.org --trusted-host files.pythonhosted.org
!pip install scikit-learn
- How to use plt.contourf() to draw contour plot
- np.c_
# planar_utils.py
def plot_decision_boundary(model, X, Y, title):
# Set min and max values and give it some padding
x1_min, x1_max = X[0, :].min() - 1, X[0, :].max() + 1
x2_min, x2_max = X[1, :].min() - 1, X[1, :].max() + 1
h = 0.01
# Generate a grid of points with distance h between them
x1x1, x2x2 = np.meshgrid(np.arange(x1_min, x1_max, h), np.arange(x2_min, x2_max, h))
# Predict the function value for the whole grid
Z = model(np.c_[x1x1.ravel(), x2x2.ravel()])
Z = Z.reshape(x1x1.shape)
# Plot the contour and training examples
plt.contourf(x1x1, x2x2, Z, cmap=plt.cm.Spectral)
plt.ylabel('x2')
plt.xlabel('x1')
plt.scatter(X[0, :], X[1, :], c=Y, cmap=plt.cm.Spectral, edgecolor='black')
plt.title(title, fontsize=15)
def sigmoid(x):
"""
Compute the sigmoid of x
Arguments:
x -- A scalar or numpy array of any size.
Return:
s -- sigmoid(x)
"""
s = 1/(1+np.exp(-x))
return s
def load_planar_dataset():
np.random.seed(1)
m = 400 # number of examples
N = int(m/2) # number of points per class
D = 2 # dimensionality (2 nodes in single hidden layer)
X = np.zeros((m,D)) # data ma trix where each row is a single example
Y = np.zeros((m,1), dtype='uint8') # labels vector (0 for red, 1 for blue)
a = 4 # maximum ray of the flower
for j in range(2):
ix = range(N*j,N*(j+1))
t = np.linspace(j*3.12,(j+1)*3.12,N) + np.random.randn(N)*0.2 # theta
r = a*np.sin(4*t) + np.random.randn(N)*0.2 # radius
X[ix] = np.c_[r*np.sin(t), r*np.cos(t)]
Y[ix] = j
X = X.T
Y = Y.T
return X, Y
def load_extra_datasets():
N = 200
noisy_circles = sklearn.datasets.make_circles(n_samples=N, factor=.5, noise=.3)
noisy_moons = sklearn.datasets.make_moons(n_samples=N, noise=.2)
blobs = sklearn.datasets.make_blobs(n_samples=N, random_state=5, n_features=2, centers=6)
gaussian_quantiles = sklearn.datasets.make_gaussian_quantiles(mean=None, cov=0.5, n_samples=N, n_features=2, n_classes=2, shuffle=True, random_state=None)
no_structure = np.random.rand(N, 2), np.random.rand(N, 2)
return noisy_circles, noisy_moons, blobs, gaussian_quantiles, no_structure
import numpy as np
import matplotlib.pyplot as plt
from testcase import *
import sklearn, sklearn.datasets, sklearn.linear_model
%matplotlib inline
np.random.seed(1)
- testCases : provides some test examples to assess the correctness of your functions
- np.random.seed(x) : numpy random seed is a numerical value that generates a pseudo-random numbers. The value in the numpy random seed saves the state of randomness. If we call the seed function using value 1 multiple times, the computer displays the same random numbers.
## Load Dataset
X, Y = load_planar_dataset()
print(X.shape) # 2x400 matrix
print(Y.shape) # 1x400 vector
print(Y.shape[1]) # training size
plt.figure(figsize=(8,6))
plt.scatter(X[0, :], X[1, :], c=Y, s=40, edgecolor='black', cmap=plt.cm.Spectral)
plt.colorbar(label = 'color') # red for 0, Blue for 1
2. Simple Logistic Regression Classifier
- before jumping right into developing classifier with shallow neural network, let’s firstly make relatively simple classifier with logistic regression model
- Through this, you can compare the performances of those two different algorithms
- Use convenient sklearn packages to import lr classifer
import warnings
from sklearn.exceptions import DataConversionWarning
warnings.filterwarnings(action='ignore', category=DataConversionWarning)
lr_clf = sklearn.linear_model.LogisticRegressionCV()
lr_clf.fit(X.T, Y.T) # X.T : (D, m)
# plot decision boundary for separating classes (0, 1)
plot_decision_boundary(lambda x : lr_clf.predict(x), X, Y, "Logistic Regression")
# Prediction Accuracy
from sklearn.metrics import accuracy_score
LR_result = lr_clf.predict(X.T)
print("Accuracy with Logistic Regression Classifier : {0}".format(accuracy_score(Y[0,:], LR_result)))
# instead of using sklearn library, you can also calcuate accuracy with code below
# dot result gives 1 only when Y and predicted value equals each other (either 0 or 1)
print(float((np.dot(Y, LR_result) + np.dot(1 - Y,1 - LR_result)) / float(Y.size)))
- The performance of logistic regression algorithm represented as accuracy score was not great, which is 47%.
- This result implies that planar dataset is not linearly separable
- So you definitely need another algorithm, hope neural network with a single layer would work better
3. Define Neural Network Model Structure
- Now, let’s finally make Neural Network model with one hidden layer to predict classes for planar dataset
- Here is the representation of our model
Gradient Descent Loop
- Implement forward propagation –> predict
- Compute loss
- Implement backward propagation to get the gradients
- Update parameters (gradient descent)
1) Forward Propagation
- 400 examples with two features, x1 and x2
- Single hidden layer contains 4 training nodes with identical activation function tanh
-
The activation function of output layer is sigmoid as it should return either 0 or 1 (threshold 0.5)
2) Computing Cost
- Cost function of NN equals to that of logstic regression
- to make convex function
3) Back-Propagation for Gradient Descent
- J = J - α * dJ/dw (α, learning rate)
- For Gradient Descent, you need to calculate the partial derivative of Cost(L) by the parameter of interest
- Then, if you want to compute partial derivative of L by w (dw), then you firstly have to compute it by a (da), then by z (dz), and then finally by w,
- Same applies to parameter b
- dL/dw (dW) = dL/da (da) * da/dz (dz) * dz/dw
- How to back proagate with sigmoid activation function
- Derivative of Multiple Activation Functions : da/dz (dz)
- here are the graphs and derivatives of various types of activation function
Summary of Gradient Descent for Our Model
- we have one input layer (400 examples with 2 features) : x (2, 400)
- hidden layer (4 nodes with ‘tanh’ activation function) : W[1] (4, 2), Z[1], b[1] –> a[1]
- one output layer (one node with sigmoid activation function) : W[2], Z[2], b[2] –> a[2]
- g[1]’(z[1]) here is 1-a[1]^2 (as g(z) is tanh(z))
- 1 - np.pow(A1, 2)
4. Build Model
- We’ve just defined the model structure, decision boudary, cost function and the loop process for gradient descent!
- Let’s build functions named ‘nn_model()’ to realize desired neural network
Set sizes of each layer, input, hidden and output
def layer_sizes(X, Y)s:
nx = X.shape[0] # (2, 400) -> 2 : size of input layer
nh = 4 # nodes 9
ny = Y.shape[0] # (1, 400) -> 1 : size of output layer
return (nx, nh, ny)
Initialize model parameters randomly
def initialize_params(nx, nh, ny):
np.random.seed(2) # set random state so that our outcomes have same value even with randomized initialization
W1 = np.random.randn(nh, nx)*0.01 # from input to hidden layer (4, 2)
b1 = np.zeros((nh, 1)) # (4, 1)
W2 = np.random.randn(ny, nh)*0.01 # (1, 4)
b2 = np.zeros((ny, 1)) # (1, 1)
# confirm that each param has the right shape
assert(W1.shape==(nh, nx) and b1.shape==(nh, 1) and W2.shape==(ny, nh) and b2.shape==(ny, 1))
params = {'W1' : W1, 'b1' : b1, 'W2' : W2, 'b2' : b2}
return params
Forward Propagation
def fp(X, params):
W1, b1, W2, b2 = (params['W1'], params['b1'], params['W2'], params['b2'])
Z1 = np.dot(W1, X) + b1 # (4, 2) x (2, 400) = (4, 400) + b(4, 1) broadcasting
A1 = np.tanh(Z1)
Z2 = np.dot(W2, A1) + b2 # (1, 4) x (4, 400) = (1, 400) + b(1, 1) broadcasting
A2 = sigmoid(Z2) # probability that x is 1
assert(A2.shape == (1, X.shape[1]))
cache = {'Z1' : Z1, 'A1' : A1, 'Z2' : Z2, 'A2' : A2}
return cache
Calculate Cost
# now calculate the cost (amount of deviation of A2 from Y)
# cost function (J(a))
# J= −1/m (i=1~i=400∑( y(i)log(a[2](i)) + (1−y(i))log(1−a[2](i)) )
def compute_cost(A2, Y):
m = Y.shape[1]
tmp = np.multiply(np.log(A2), Y) + np.multiply(np.log(1 - A2), 1-Y)
cost = -np.sum(tmp)/m
cost = float(np.squeeze(cost)) # remove axis whose size is 1
assert(isinstance(cost, float))
return cost
Compute gradient descent
# now let's compute gradient descent of neural network with back-propagation
def bp(params, cache, X, Y):
""""
Returns: grads -- gradients with respect to different parameters (dW1, dW2, db1, db2)
"""
m = X.shape[1] # 400
W1, b1, W2, b2 = (params['W1'], params['b1'], params['W2'], params['b2'])
Z1, A1, Z2, A2 = (cache['Z1'], cache['A1'], cache['Z2'], cache['A2'])
dZ2 = A2 - Y # (1, 400)
dW2 = (1/m)*np.dot(dZ2, A1.T) # (1, 400) x (400, 4) --> (1, 4)
db2 = (1/m)*np.sum(dZ2, axis=1, keepdims=True) # (1, 1)
dZ1 = np.multiply(np.dot(W2.T, dZ2), 1 - np.power(A1, 2)) # (4, 1) x (1, 400) * (4, 400) --> (4, 400)
dW1 = (1/m)*np.dot(dZ1, X.T) # (4, 400) * (400, 2) --> (4, 2)
db1 = (1/m)*np.sum(dZ1, axis=1, keepdims=True) # (1, 1)
grads = {'dW1': dW1, 'dW2': dW2, 'db1' : db1, 'db2': db2}
return grads
Summary -
Update Parameters
def update_params(params, grads, lr = 1.2):
W1, b1, W2, b2 = (params['W1'], params['b1'], params['W2'], params['b2'])
dW1, db1, dW2, db2 = (grads['dW1'], grads['db1'], grads['dW2'], grads['db2'])
W1 -= lr*dW1 # (4, 2)
W2 -= lr*dW2 # (1, 4)
b1 -= lr*db1 # (4, 1)
b2 -= lr*db2 # (1, 1)
updated_params = {'W1' : W1, 'b1' : b1, 'W2' : W2, 'b2' : b2}
return updated_params
Build NN model
def nn_model(X, Y, nh, num_iter, print_cost):
"""
Arguments:
num_iterations -- Number of iterations in gradient descent loop
print_cost -- if True, print the cost every 1000 iterations
Returns:
parameters -- parameters learnt by model (fp -> compute cost & bp -> update)
"""
np.random.seed(3)
nx, _, ny = layer_sizes(X, Y)
params = initialize_params(nx, nh, ny)
min_cost = float('inf')
learning_curve = []
for i in range(num_iter):
cache = fp(X, params)
A2 = cache['A2']
cost = compute_cost(A2, Y)
grads = bp(params, cache, X, Y)
params = update_params(params, grads)
if cost < min_cost:
min_cost = cost
best_params = params
if print_cost and i%10 == 0:
learning_curve.append(cost)
if i%1000 == 0:
print("Cost after iterations {0} : {1}".format(i, cost))
# print(cost)
return params, best_params, min_cost, learning_curve
- Now Finally, we’ve made our NN model!
- From now on, we will gonna predict the classes of examples (either 0 or 1) using our model
- Decision Rule
Predict classes of X
def predict(best_params, X):
"""
Using the learned parameters, predicts a class for each example in X
Arguments:
parameters -- python dictionary containing your parameters
X -- input data of size (n_x, m)
Returns
predictions -- vector of predictions of our model (red: 0 / blue: 1)
"""
cache = fp(X, best_params)
A2 = cache['A2'] # predicted values (not classes)
preds = A2 > 0.5 # (1, 400)
return preds
# Result
params, best_params, min_cost, learning_curve = nn_model(X, Y, 4, 10000, 1)
# plt.figure(figsize=(8, 5))
# plt.axis([0, 1000, 0, max(learning_curve)]) # [xmin, xmax, ymin, ymax]
plt.plot(learning_curve)
plt.xlabel("Iterations")
plt.ylabel("Cost")
plt.title("Learning Curve (Cost by Iterations)", size=15)
# x : comes with shape (400, 2)
plt.figure(figsize=(8, 6))
plot_decision_boundary(lambda x : predict(best_params, x.T), X, Y, "Neural Network Model with a Single Hidden Layer")
def accuracy_score(preds, Y, nh):
m = Y.shape[1]
error = float((np.dot(1-Y, preds.T) + np.dot(Y, 1-preds.T))/m)
# same as np.sum(np.multiply(1-Y, preds) + np.multiply(Y, 1-preds))/m
print("Accuracy with Neural Network with 1 Hidden Layer with {0} Units: {1} %".format(nh, (1-error)*100))
return (1-error)*100
_, best_params, _, _ = nn_model(X, Y, 4, 10000, 0)
preds = predict(best_params, X)
m = X.shape[1]
accuracy_score(preds, Y, 4)
- From our NN model (1 hidden layer with 4 units), we’ve just gained 90% accuracy
- Previously, accuracy from Logistic Regression Classifier was 47%
- Now you can see NN with only one hidden layer can outperform Logistic regression
- Our NN model has learnt the leaf patterns of the flower, which shows NN can learn even highly non-linear decision boundaries, unlike logistic regression.
5. Compare Accuracy of NN with Different Unit Sizes of Hidden Layer
plt.figure(figsize=(12, 24))
h_sizes = [1, 4, 8, 12, 16, 20, 50]
for i, nh in enumerate(h_sizes):
plt.subplot(4, 2, i+1)
_, best_params, _, _ = nn_model(X, Y, nh, 10000, 0)
preds = predict(best_params, X)
plot_decision_boundary(lambda x : predict(best_params, x.T), X, Y, "Hidden Units {0}".format(nh))
accuracy_score(preds, Y, nh)
Interpretations :
- The larger models (with more hidden units) are able to fit the training set better, until eventually the largest models overfit the data.
- The best hidden layer size seems to be around nh 8.
- Indeed, values greater than 8 seem to incur noticable overfitting as shown in decision boundary contour plot.
- You will also learn later about regularization, which lets you use very large models (such as n_h = 50) without much overfitting.
6. Performance on Other Datasets
- Now, let’s test our model performance on other 4 datasets
- Unit size of 1 Hidden layer will be fixed as 8, which was figured as best unit size that prevents overfitting
datasets = dict()
datasets['noisy_circles'], datasets['noisy_moons'], datasets['blobs'], datasets['gaussian_quantiles'], _ = load_extra_datasets()
def accuracy_score_2(preds, Y, d):
m = Y.shape[1]
error = float((np.dot(1-Y, preds.T) + np.dot(Y, 1-preds.T))/m)
# same as np.sum(np.multiply(1-Y, preds) + np.multiply(Y, 1-preds))/m
print("Accuracy for Dataset <{0}> with Unit Size 8 : {1}%".format(d, (1-error)*100))
return (1-error)*100
plt.figure(figsize=(12, 24))
for i, d in enumerate(datasets.keys()):
plt.subplot(3, 2, i+1)
X, Y = datasets[d]
# print(X.shape, Y.shape)
X, Y = X.T, Y.reshape(1, Y.shape[0])
if d == 'blobs':
Y = Y%2
_, best_params, _, _ = nn_model(X, Y, 8, 10000, 0)
preds = predict(best_params, X)
plot_decision_boundary(lambda x : predict(best_params, x.T), X, Y, "Datasets : {0} with Unit size 8".format(d))
accuracy_score_2(preds, Y, d)
- Performance of our NN model quite differs by the datasets
- But, definitely can tell that our model can learn highly non-linear, complex decision boundaries with pretty fine accuracy