[Stanford CS229 01] Linear Regression and Gradient Descent
OUTLINES
- MULTIVARIATE LINEAR REGRESSION
- BATCH/ STOCHASTIC GRADIENT DESCENT
- NORMAL EQUATION
1. Multivariate Linear Regression
1.1. Multiple Features
-
$x^{i}$ : $i_{th}$ input variables (set of features)
-
$y^{i}$ : $i_{th}$ output variable (target variable) that we’re trying to predict
-
$ (x^{i}, y^{i})$ For $i = 1, 2, 3,…,m$ : training dataset
-
Hypothesis
$ h_{\theta}(x) = \sum \limits_{j=1}^{n} \theta_{j}x_{j} $
- $x_{j}$ For $j = 1, 2, 3,…,n$ : value of $j_{th}$ feature of all n input features
- set $x_{0}$ as 0 (intercept term, b)
- $\theta_{j}$ For $j = 1, 2, 3,…,n$ : $j_{th}$ parameter of n parameters each (weights) parameterizing the space of linear functions mapping from x to y
-
Matrix Representation of Hypothesis
$\theta = \begin{bmatrix} \theta_{1} \ \theta_{2} \ . \ . \ . \ \theta_{n} \end{bmatrix}$
$ x^{i} = \begin{bmatrix} x^{i}_{1} \ . \ . \ . \ \end{bmatrix}$
$h(x^{i}) = \theta^{T}x^{i}$
1.2. Cost Function
- Trying to minimize the deviations of $h(x)$ from $y$
- Least Mean Square (LMS algorithm)
$ J(\theta) = \sum\limits_{i=1}^{m} (h_{\theta}(x^{i}) - y^{i})^{2} $
- LMS algorithm with Gradient Descent
- algorithm starts with some initial guess with $\theta_{j}$ with radomized values and repeatedly updates the paratmeters using gradient descent algorithm
- take partial derivative with respect to every parameter multiplied by learning rate ($\alpha$) and substract it from previous value of paramter
$ \theta_{j} := \theta_{j} - \alpha\frac{\partial J(\theta)}{\partial \theta_{j}}$ $For j = 1,2,3,…,n $
- $\alpha$ (learning rate) : regulates the speed of adjusting parameters so that prevents over-fitting
- try multiple cases and find best one
-
repeat updating parameters for every step of gradient descent
-
Partial Derivative of $J(\theta)$
- $\theta_{j} := \theta_{j} - \alpha (h_{\theta}(x) - y)x_{j} $ $ For j = 1,2,3,…,n $
- larger change will be made with larger error term ($ h(\theta) - y $)
- Repeat the update untill convergence
- algorithm starts with some initial guess with $\theta_{j}$ with radomized values and repeatedly updates the paratmeters using gradient descent algorithm
2. Batch Gradient Descent (BGD) vs Stochastic Gradient Descent (SGD)
- In BGD (1 update per batch):
- the algorithm updates the model parameters after processing the entire training dataset.
- The cost function $ J(\theta) $ is first computed over all the training examples and then the gradient of the cost function with respect to the parameters is computed.
$ \theta_{j} := \theta_{j} - \alpha (h_{\theta}(x) - y)x_{j} $ for every j
- In SGD (1 update per data point):
- updates the model parameters after processing each individual training example.
- for each iteration, the algorithm randomly selects one training example, computes the gradient with respect to that example, and then updates the parameters based on that gradient.
- BGD processes the entire training set at each iteration, which is computationally expensive but accurate.
- SGD processes a signle training example at a time so that can coverge much faster.
- While SGD has economical advantage over BGD, it may never be converge on global minimum, only oscillating around the local minimum.
- Therefore, BGD can converge to the optimum more accurately and quickly on small datasets, while SGD can converge faster on large datasets.
Check for Convergence with Stochastic gradient descent
- how to check SGD has convergd to global minimum (at least close)
-
how to tune learning rate α to get proper convergence?
- Plotting $ J(\theta) $ averaged over N examples
- decrease learning rate (upper left)
- slower the convergence
- but obtain slightly better cost (negligible sometimes)
- increase N (>= 5000) (upper right)
- also takes more time to plot (longer time to get single plotting point)
- can smoothen the cost line
- increase N (lower left)
- line will fluctuate too much, preventing you from seeing actual trend
- if you elevate N, then you can see what’s actually going on
- decrease learning rate (lower right)
- it shows that your algorithm fails to converge to minimum, (diverging, fails to find optimal parameters)
- you should adjust your learning rate smaller, so that it can converge
- decrease learning rate (upper left)
- Learning rate (α)
- typically, α helds constant through entire learning process
- but, can also slowly decrease α over time (if you want the model to converge better)
- α = $\beta\,$ / ($\,$iterationNumber$\,$ + $\gamma$)
- need to take additional time to decide what $\beta$ and $\gamma$ are
- guaranteed to converge somewhere rathter than oscillating around it
- SGD can be a good algorithm for massive training examples
3. Normal Equation
- Closed-form solution for linear regression problems, which can be used to find the optimal parameters of a linear model. (only applicable to linear regresson case)
- provides a way to compute the optimized parameter vector theta directly from the training data by solving the equation, without the need for an iterative optimization algorithm such as gradient descent.
- Explicitly takes derivatives of cost function with respect to $\theta_{j}$s and solve by setting it to be 0.
3.1. Matrix Dervatives
3.2. Properties of $\nabla$ and Trace of Matrix
3.3 Least Mean Square solved with Normal Equation
- The amout of computations needed to solve normal equation depends on n (the number of parameters) with $O(n^{2})$
- For dataset with smaller number of paramters, solving normal equation instead of iterative gradient descent will be efficient.