[Stanford CS229 01] Linear Regression and Gradient Descent

OUTLINES

$x^{i}$ : $i_{th}$ input variables (set of features)
$y^{i}$ : $i_{th}$ output variable (target variable) that we’re trying to predict
$ (x^{i}, y^{i})$ For $i = 1, 2, 3,…,m$ : training dataset
Hypothesis

$ h_{\theta}(x) = \sum \limits_{j=1}^{n} \theta_{j}x_{j} $
- $x_{j}$ For $j = 1, 2, 3,…,n$ : value of $j_{th}$ feature of all n input features
- set $x_{0}$ as 0 (intercept term, b)
- $\theta_{j}$ For $j = 1, 2, 3,…,n$ : $j_{th}$ parameter of n parameters each (weights) parameterizing the space of linear functions mapping from x to y
Matrix Representation of Hypothesis

$\theta = \begin{bmatrix} \theta_{1} \ \theta_{2} \ . \ . \ . \ \theta_{n} \end{bmatrix}$

$ x^{i} = \begin{bmatrix} x^{i}_{1} \ . \ . \ . \ \end{bmatrix}$

$h(x^{i}) = \theta^{T}x^{i}$

$ J(\theta) = \sum\limits_{i=1}^{m} (h_{\theta}(x^{i}) - y^{i})^{2} $

LMS algorithm with Gradient Descent
- algorithm starts with some initial guess with $\theta_{j}$ with radomized values and repeatedly updates the paratmeters using gradient descent algorithm
- take partial derivative with respect to every parameter multiplied by learning rate ($\alpha$) and substract it from previous value of paramter
$ \theta_{j} := \theta_{j} - \alpha\frac{\partial J(\theta)}{\partial \theta_{j}}$ $For j = 1,2,3,…,n $
- $\alpha$ (learning rate) : regulates the speed of adjusting parameters so that prevents over-fitting
  - try multiple cases and find best one
- repeat updating parameters for every step of gradient descent
- Partial Derivative of $J(\theta)$
- $\theta_{j} := \theta_{j} - \alpha (h_{\theta}(x) - y)x_{j} $ $ For j = 1,2,3,…,n $
- larger change will be made with larger error term ($ h(\theta) - y $)
- Repeat the update untill convergence

In BGD (1 update per batch):
- the algorithm updates the model parameters after processing the entire training dataset.
- The cost function $ J(\theta) $ is first computed over all the training examples and then the gradient of the cost function with respect to the parameters is computed.
  $ \theta_{j} := \theta_{j} - \alpha (h_{\theta}(x) - y)x_{j} $ for every j

In SGD (1 update per data point):
- updates the model parameters after processing each individual training example.
- for each iteration, the algorithm randomly selects one training example, computes the gradient with respect to that example, and then updates the parameters based on that gradient.

BGD processes the entire training set at each iteration, which is computationally expensive but accurate.
SGD processes a signle training example at a time so that can coverge much faster.
While SGD has economical advantage over BGD, it may never be converge on global minimum, only oscillating around the local minimum.
Therefore, BGD can converge to the optimum more accurately and quickly on small datasets, while SGD can converge faster on large datasets.

Closed-form solution for linear regression problems, which can be used to find the optimal parameters of a linear model. (only applicable to linear regresson case)
provides a way to compute the optimized parameter vector theta directly from the training data by solving the equation, without the need for an iterative optimization algorithm such as gradient descent.
Explicitly takes derivatives of cost function with respect to $\theta_{j}$s and solve by setting it to be 0.

Screen Shot 2023-03-22 at 10 20 37 PM

Screen Shot 2023-03-22 at 10 20 42 PM

Screen Shot 2023-03-22 at 10 20 47 PM

The amout of computations needed to solve normal equation depends on n (the number of parameters) with $O(n^{2})$
For dataset with smaller number of paramters, solving normal equation instead of iterative gradient descent will be efficient.