[Neural Networks and Deep Learning] Basics of Neural Network Programming
2022, Apr 20
Binary Classification
- Example : Cat Classifier
- with a given image, you can convert the image to three 64 x 64 matrices corresponding to Red, Green, Blue pixel intensity values for your image
- Now unroll all these matrices to a single feature vector X, with the size of [64x64x3 x 1] vector
- Notation
- stacking different data(examples) into different column of X and Y
- X.shape = (nx, m)
- nx : length of x(i), the size of all R, G, B matrices unrolled
- Y : [y1, y2, y3…, ym] (1, m)
Logistic Regression as a Neural Network
- Binary output : Outputs of Y is always either 1 or 0
- you want
- In Linear Regression
- you can get output by using the equation y = WTx + b
- W : (nx, 1) vector of weights of each feature / b : real number (intersect)
- BUT, with this linear function, you can’t get what you want, the chance that y of given example equals to 1 (value ranging from 0 ~ 1)
- In Logistic Regression
- Instead, you can use sigmoid function with which you can get the output ranging from 0 ~ 1 depending on the x values
- here, z equals to the previous value obtained from linear regression, WTX + b
- when x infinitely increases, g(z) converge to 1, whereas x infinitely decreases, g(z) converge to 0.
- all g(z) values are within between 0 ~ 1
- when z equals to 0, you will get 0.5 as g(z)
- Also, you can alternatively define b as x0 and set w0 as 1, so that you can incorporate b into WTX part
- here’s the outcome
Logistic Regression Cost Function
- Training set of m training examples, Each example has is n+1 length column vector
- Given the training set how to we chose/fit θ?
- Cost function of linear regression was like below,
- Instead of writing the squared error term, we can write
- cost(hθ(xi), y) = 1/2(hθ(xi) - yi)2
- Which evaluates the cost for an individual example using the same measure as used in linear regression
- We can redefine J(θ) as
Which, appropriately, is the sum of all the individual costs over the training data (i.e. the same as linear regression)
- This is the cost you want the learning algorithm to pay if the outcome is hθ(x) and the actual outcome is y
- Issue : If we use this function for logistic regression, this is a Non-convex function for parameter optimization
- non-convex function : wavy - has some ‘valleys’ (local minima) that aren’t as deep as the overall deepest ‘valley’ (global minimum).
- Optimization algorithms can get stuck in the local minimum, and it can be hard to tell when this happens.
- A convex logistic regression cost function
- To get around this we need a different, convex Cost() function which means we can apply gradient descent
- This is our logistic regression cost function
- This is the penalty the algorithm pays
- Plot the function
- Plot y = 1
- So hθ(x) evaluates as -log(hθ(x))
- plot y=0
- So hθ(x) evaluates as -log(1-hθ(x))
Combined Cost Function of RL
- Instead of separating cost function into two parts differing by the value of y (0 or 1),
-
we can compress it into one cost function, which makes it more convenient to write out the cost.
- cost(hθ, (x),y) = -ylog( hθ(x) ) - (1-y)log( 1- hθ(x) )
- y can only be either 0 or 1
- when y = 0, only -log( 1- hθ(x) ) part remains, which is exactly the same as the original one
- when y =1, only -log( hθ(x) ) part remains
- now! you can finally get convex cost function that has global optima
Optimizing Cost Function w/ Gradient Descent
- Interestingly, derivative of J(θ) of logistic regression is exactly identical with that of linear regression (proof of this statement will be covered later)
- Firstly, you would set all the features(w1~wm) as 0, including w0 (intersect, b)
- and then, Repeat
- Representation of the process of finding global optima
- BUT! this optimizing algorithm has serious weakness, which is explicit double for-loop
- first for-loop is for iterations of algorithm untill you reach to global optima
- secondly, you need to have a for loop over all the features
- this explicit for-loop can severly slower the training rate with the large dataset
- So, instaead of this, you need to learn “Vectorization” with which you can get rid of these explicit for-loop
– Proof : Getting Derivative of LR Cost Function –
- Remember hθ(x) is
- Step1 : take partial derivative of h(θ) = 1/(1 + e-z)
- Step2 : take partial derivative to J(θ)
Computation Graph
- Previously, I figured out the partial derivative of J (dJ/dθ), by using Chain Rule
- Chain Rule : backward propagation of taking derivative partially with respect to from final output variable (here, v) to starting variable (here, a)
Vectorization with Python
- vectoriztion can save you a great amount of time by removing explicit for loop from your algorithm!
-
let’s see if it’s true with python code
import numpy as np
import time
a = np.random.rand(1000000)
b = np.random.rand(1000000)
tic = time.time()
vec= np.dot(a, b) # calculate inner product of a, b vector (1D)
toc = time.time()
print(vec)
print("Vectorized Version : {0}ms".format(1000*(toc-tic)))
tick = time.time()
skr = 0
for i in range(1000000):
skr += a[i]*b[i]
tock = time.time()
print(skr)
print("Scalar Version : {0}ms".format(1000*(tock-tick)))
249812.28927442286
Vectorized Version : 2.006053924560547ms
249812.28927442944
Scalar Version : 1888.9873027801514ms
- the results of both algorithm are same
-
BUT, it takes about 1000 times longer time to calculate the inner product of 1d vector a & b
- There are some numpy functions that allow you to apply exponential or log operation on every element of a matrix/vector
- np.log(V), np.exp(V)
Logistic Regression with Vectorization
- logistic regression with For-Loops
- suppose we have ‘n’ features
- there are ‘m’ samples
- without vectorization, you have to use 2 for-loops, one for i (from 1 to n) and another for j (from 1 to m)
- Vectorizing Logistic Regression
- with vectorized LR, all you need to calculate the Gradient Descent of Cost function for each iteration is just two liens of code
- db = 1/m(np.sum(dZ)
- dw = 1/m(XdZT)
- you don’t need “ANY” foor-loops
- but even with vectorized LR, you still need to use for-loop for iterations to gd minimizing the cost
Broadcasting in Python
- It refers to how numpy treats arrays with different Dimension during arithmetic operations(+, -, *, /) which lead to certain constraints
- the smaller array is broadcasted across the larger array so that they have compatible shapes
- Even though broadcasting of python-numpy provides lots of benefits such as convenience and flexibility, but it can also cause a few bugs when mistreated
- For effective usage of only the strengths of broadcasting of python-numpy, except the weaknesses
- recommend not to use “Rank 1 Array” like np.random.randn(5), which has pretty non-intuitive shapes
- Instead, you can use vector like np.random.randn(5,1)