[Neural Networks and Deep Learning] Basics of Neural Network Programming

2022, Apr 20    


Binary Classification


  • Example : Cat Classifier
    • with a given image, you can convert the image to three 64 x 64 matrices corresponding to Red, Green, Blue pixel intensity values for your image
    • Now unroll all these matrices to a single feature vector X, with the size of [64x64x3 x 1] vector


  • Notation
    • stacking different data(examples) into different column of X and Y


  • X.shape = (nx, m)
  • nx : length of x(i), the size of all R, G, B matrices unrolled
  • Y : [y1, y2, y3…, ym] (1, m)


Logistic Regression as a Neural Network


  • Binary output : Outputs of Y is always either 1 or 0
    • you want
  • In Linear Regression
    • you can get output by using the equation y = WTx + b
    • W : (nx, 1) vector of weights of each feature / b : real number (intersect)
    • BUT, with this linear function, you can’t get what you want, the chance that y of given example equals to 1 (value ranging from 0 ~ 1)
  • In Logistic Regression
    • Instead, you can use sigmoid function with which you can get the output ranging from 0 ~ 1 depending on the x values
  • here, z equals to the previous value obtained from linear regression, WTX + b
  • when x infinitely increases, g(z) converge to 1, whereas x infinitely decreases, g(z) converge to 0.
  • all g(z) values are within between 0 ~ 1
  • when z equals to 0, you will get 0.5 as g(z)
  • Also, you can alternatively define b as x0 and set w0 as 1, so that you can incorporate b into WTX part
    • here’s the outcome


Logistic Regression Cost Function


  • Training set of m training examples, Each example has is n+1 length column vector
  • Given the training set how to we chose/fit θ?
    • Cost function of linear regression was like below,
  • Instead of writing the squared error term, we can write
  • cost(hθ(xi), y) = 1/2(hθ(xi) - yi)2
  • Which evaluates the cost for an individual example using the same measure as used in linear regression
    • We can redefine J(θ) as

Which, appropriately, is the sum of all the individual costs over the training data (i.e. the same as linear regression)

  • This is the cost you want the learning algorithm to pay if the outcome is hθ(x) and the actual outcome is y
  • Issue : If we use this function for logistic regression, this is a Non-convex function for parameter optimization
    • non-convex function : wavy - has some ‘valleys’ (local minima) that aren’t as deep as the overall deepest ‘valley’ (global minimum).
    • Optimization algorithms can get stuck in the local minimum, and it can be hard to tell when this happens.
  • A convex logistic regression cost function
    • To get around this we need a different, convex Cost() function which means we can apply gradient descent
  • This is our logistic regression cost function
    • This is the penalty the algorithm pays
    • Plot the function
    1. Plot y = 1
      • So hθ(x) evaluates as -log(hθ(x))
    2. plot y=0
      • So hθ(x) evaluates as -log(1-hθ(x))


Combined Cost Function of RL


  • Instead of separating cost function into two parts differing by the value of y (0 or 1),
  • we can compress it into one cost function, which makes it more convenient to write out the cost.

    • cost(hθ, (x),y) = -ylog( hθ(x) ) - (1-y)log( 1- hθ(x) )
    • y can only be either 0 or 1
    • when y = 0, only -log( 1- hθ(x) ) part remains, which is exactly the same as the original one
    • when y =1, only -log( hθ(x) ) part remains
  • now! you can finally get convex cost function that has global optima


Optimizing Cost Function w/ Gradient Descent


  • Interestingly, derivative of J(θ) of logistic regression is exactly identical with that of linear regression (proof of this statement will be covered later)
  • Firstly, you would set all the features(w1~wm) as 0, including w0 (intersect, b)
  • and then, Repeat
  • Representation of the process of finding global optima
  • BUT! this optimizing algorithm has serious weakness, which is explicit double for-loop
  • first for-loop is for iterations of algorithm untill you reach to global optima
  • secondly, you need to have a for loop over all the features
  • this explicit for-loop can severly slower the training rate with the large dataset
  • So, instaead of this, you need to learn “Vectorization” with which you can get rid of these explicit for-loop


– Proof : Getting Derivative of LR Cost Function –


  • Remember hθ(x) is
  • Step1 : take partial derivative of h(θ) = 1/(1 + e-z)
  • Step2 : take partial derivative to J(θ)


Computation Graph


  • Previously, I figured out the partial derivative of J (dJ/dθ), by using Chain Rule
    • Chain Rule : backward propagation of taking derivative partially with respect to from final output variable (here, v) to starting variable (here, a)


Vectorization with Python


  • vectoriztion can save you a great amount of time by removing explicit for loop from your algorithm!
    • let’s see if it’s true with python code

import numpy as np
import time

a = np.random.rand(1000000)
b = np.random.rand(1000000)

tic = time.time()
vec= np.dot(a, b)    # calculate inner product of a, b vector (1D)
toc = time.time()

print(vec)
print("Vectorized Version : {0}ms".format(1000*(toc-tic)))

tick = time.time()
skr = 0
for i in range(1000000):
    skr += a[i]*b[i]

tock = time.time()
    
print(skr)
print("Scalar Version : {0}ms".format(1000*(tock-tick)))
249812.28927442286
Vectorized Version : 2.006053924560547ms
249812.28927442944
Scalar Version : 1888.9873027801514ms
  • the results of both algorithm are same
  • BUT, it takes about 1000 times longer time to calculate the inner product of 1d vector a & b

  • There are some numpy functions that allow you to apply exponential or log operation on every element of a matrix/vector
  • np.log(V), np.exp(V)


Logistic Regression with Vectorization


  • logistic regression with For-Loops
    • suppose we have ‘n’ features
    • there are ‘m’ samples
    • without vectorization, you have to use 2 for-loops, one for i (from 1 to n) and another for j (from 1 to m)
  • Vectorizing Logistic Regression
    • with vectorized LR, all you need to calculate the Gradient Descent of Cost function for each iteration is just two liens of code
    • db = 1/m(np.sum(dZ)
    • dw = 1/m(XdZT)
    • you don’t need “ANY” foor-loops
    • but even with vectorized LR, you still need to use for-loop for iterations to gd minimizing the cost


Broadcasting in Python


  • It refers to how numpy treats arrays with different Dimension during arithmetic operations(+, -, *, /) which lead to certain constraints
  • the smaller array is broadcasted across the larger array so that they have compatible shapes
  • Even though broadcasting of python-numpy provides lots of benefits such as convenience and flexibility, but it can also cause a few bugs when mistreated
  • For effective usage of only the strengths of broadcasting of python-numpy, except the weaknesses
  • recommend not to use “Rank 1 Array” like np.random.randn(5), which has pretty non-intuitive shapes
  • Instead, you can use vector like np.random.randn(5,1)