[Neural Networks and Deep Learning] Basics of Neural Network Programming

Binary Classification

Example : Cat Classifier
- with a given image, you can convert the image to three 64 x 64 matrices corresponding to Red, Green, Blue pixel intensity values for your image
- Now unroll all these matrices to a single feature vector X, with the size of [64x64x3 x 1] vector

Notation
- stacking different data(examples) into different column of X and Y

X.shape = (nx, m)
nx : length of x(i), the size of all R, G, B matrices unrolled
Y : [y1, y2, y3…, ym] (1, m)

Logistic Regression as a Neural Network

Binary output : Outputs of Y is always either 1 or 0
- you want
In Linear Regression
- you can get output by using the equation y = WTx + b
- W : (nx, 1) vector of weights of each feature / b : real number (intersect)
- BUT, with this linear function, you can’t get what you want, the chance that y of given example equals to 1 (value ranging from 0 ~ 1)
In Logistic Regression
- Instead, you can use sigmoid function with which you can get the output ranging from 0 ~ 1 depending on the x values
here, z equals to the previous value obtained from linear regression, WTX + b
when x infinitely increases, g(z) converge to 1, whereas x infinitely decreases, g(z) converge to 0.
all g(z) values are within between 0 ~ 1
when z equals to 0, you will get 0.5 as g(z)
Also, you can alternatively define b as x0 and set w0 as 1, so that you can incorporate b into WTX part
- here’s the outcome

Logistic Regression Cost Function

Training set of m training examples, Each example has is n+1 length column vector
Given the training set how to we chose/fit θ?
- Cost function of linear regression was like below,
Instead of writing the squared error term, we can write
cost(hθ(xi), y) = 1/2(hθ(xi) - yi)2
Which evaluates the cost for an individual example using the same measure as used in linear regression
- We can redefine J(θ) as

Which, appropriately, is the sum of all the individual costs over the training data (i.e. the same as linear regression)

This is the cost you want the learning algorithm to pay if the outcome is hθ(x) and the actual outcome is y
Issue : If we use this function for logistic regression, this is a Non-convex function for parameter optimization
- non-convex function : wavy - has some ‘valleys’ (local minima) that aren’t as deep as the overall deepest ‘valley’ (global minimum).
- Optimization algorithms can get stuck in the local minimum, and it can be hard to tell when this happens.
A convex logistic regression cost function
- To get around this we need a different, convex Cost() function which means we can apply gradient descent
This is our logistic regression cost function
- This is the penalty the algorithm pays
- Plot the function
1. Plot y = 1
  - So hθ(x) evaluates as -log(hθ(x))
2. plot y=0
  - So hθ(x) evaluates as -log(1-hθ(x))

Combined Cost Function of RL

Instead of separating cost function into two parts differing by the value of y (0 or 1),
we can compress it into one cost function, which makes it more convenient to write out the cost.
- cost(hθ, (x),y) = -ylog( hθ(x) ) - (1-y)log( 1- hθ(x) )
- y can only be either 0 or 1
- when y = 0, only -log( 1- hθ(x) ) part remains, which is exactly the same as the original one
- when y =1, only -log( hθ(x) ) part remains
now! you can finally get convex cost function that has global optima

Optimizing Cost Function w/ Gradient Descent

Interestingly, derivative of J(θ) of logistic regression is exactly identical with that of linear regression (proof of this statement will be covered later)
Firstly, you would set all the features(w1~wm) as 0, including w0 (intersect, b)
and then, Repeat
Representation of the process of finding global optima
BUT! this optimizing algorithm has serious weakness, which is explicit double for-loop
first for-loop is for iterations of algorithm untill you reach to global optima
secondly, you need to have a for loop over all the features
this explicit for-loop can severly slower the training rate with the large dataset
So, instaead of this, you need to learn “Vectorization” with which you can get rid of these explicit for-loop

– Proof : Getting Derivative of LR Cost Function –

Remember hθ(x) is
Step1 : take partial derivative of h(θ) = 1/(1 + e-z)
Step2 : take partial derivative to J(θ)

Computation Graph

Previously, I figured out the partial derivative of J (dJ/dθ), by using Chain Rule
- Chain Rule : backward propagation of taking derivative partially with respect to from final output variable (here, v) to starting variable (here, a)

Vectorization with Python

vectoriztion can save you a great amount of time by removing explicit for loop from your algorithm!
- let’s see if it’s true with python code

import numpy as np
import time

a = np.random.rand(1000000)
b = np.random.rand(1000000)

tic = time.time()
vec= np.dot(a, b)    # calculate inner product of a, b vector (1D)
toc = time.time()

print(vec)
print("Vectorized Version : {0}ms".format(1000*(toc-tic)))

tick = time.time()
skr = 0
for i in range(1000000):
    skr += a[i]*b[i]

tock = time.time()
    
print(skr)
print("Scalar Version : {0}ms".format(1000*(tock-tick)))

249812.28927442286
Vectorized Version : 2.006053924560547ms
249812.28927442944
Scalar Version : 1888.9873027801514ms

the results of both algorithm are same
BUT, it takes about 1000 times longer time to calculate the inner product of 1d vector a & b
There are some numpy functions that allow you to apply exponential or log operation on every element of a matrix/vector
np.log(V), np.exp(V)

Logistic Regression with Vectorization

logistic regression with For-Loops
- suppose we have ‘n’ features
- there are ‘m’ samples
- without vectorization, you have to use 2 for-loops, one for i (from 1 to n) and another for j (from 1 to m)
Vectorizing Logistic Regression
- with vectorized LR, all you need to calculate the Gradient Descent of Cost function for each iteration is just two liens of code
- db = 1/m(np.sum(dZ)
- dw = 1/m(XdZT)
- you don’t need “ANY” foor-loops
- but even with vectorized LR, you still need to use for-loop for iterations to gd minimizing the cost

Broadcasting in Python

It refers to how numpy treats arrays with different Dimension during arithmetic operations(+, -, *, /) which lead to certain constraints
the smaller array is broadcasted across the larger array so that they have compatible shapes
Even though broadcasting of python-numpy provides lots of benefits such as convenience and flexibility, but it can also cause a few bugs when mistreated
For effective usage of only the strengths of broadcasting of python-numpy, except the weaknesses
recommend not to use “Rank 1 Array” like np.random.randn(5), which has pretty non-intuitive shapes
Instead, you can use vector like np.random.randn(5,1)