[Stanford CS229 01] Linear Regression and Gradient Descent

[Stanford CS229 01] Linear Regression and Gradient Descent

2023, Feb 17    


OUTLINES

  1. MULTIVARIATE LINEAR REGRESSION
  2. BATCH/ STOCHASTIC GRADIENT DESCENT
  3. NORMAL EQUATION



1. Multivariate Linear Regression


1.1. Multiple Features


  • $x^{i}$ : $i_{th}$ input variables (set of features)

  • $y^{i}$ : $i_{th}$ output variable (target variable) that we’re trying to predict

  • $ (x^{i}, y^{i})$ For $i = 1, 2, 3,…,m$ : training dataset

  • Hypothesis

          $ h_{\theta}(x) = \sum \limits_{j=1}^{n} \theta_{j}x_{j} $

    • $x_{j}$ For $j = 1, 2, 3,…,n$ : value of $j_{th}$ feature of all n input features
    • set $x_{0}$ as 0 (intercept term, b)
    • $\theta_{j}$ For $j = 1, 2, 3,…,n$ : $j_{th}$ parameter of n parameters each (weights) parameterizing the space of linear functions mapping from x to y

  • Matrix Representation of Hypothesis

         $\theta = \begin{bmatrix} \theta_{1} \ \theta_{2} \ . \ . \ . \ \theta_{n} \end{bmatrix}$

         $ x^{i} = \begin{bmatrix} x^{i}_{1} \ . \ . \ . \ \end{bmatrix}$

         $h(x^{i}) = \theta^{T}x^{i}$


1.2. Cost Function


  • Trying to minimize the deviations of $h(x)$ from $y$
  • Least Mean Square (LMS algorithm)

     $ J(\theta) = \sum\limits_{i=1}^{m} (h_{\theta}(x^{i}) - y^{i})^{2} $

  • LMS algorithm with Gradient Descent
    • algorithm starts with some initial guess with $\theta_{j}$ with radomized values and repeatedly updates the paratmeters using gradient descent algorithm
    • take partial derivative with respect to every parameter multiplied by learning rate ($\alpha$) and substract it from previous value of paramter

         $ \theta_{j} := \theta_{j} - \alpha\frac{\partial J(\theta)}{\partial \theta_{j}}$   $For j = 1,2,3,…,n $

    • $\alpha$ (learning rate) : regulates the speed of adjusting parameters so that prevents over-fitting
      • try multiple cases and find best one
    • repeat updating parameters for every step of gradient descent

    • Partial Derivative of $J(\theta)$

    • $\theta_{j} := \theta_{j} - \alpha (h_{\theta}(x) - y)x_{j} $   $ For j = 1,2,3,…,n $
    • larger change will be made with larger error term ($ h(\theta) - y $)
    • Repeat the update untill convergence



2. Batch Gradient Descent (BGD) vs Stochastic Gradient Descent (SGD)


  • In BGD (1 update per batch):
    • the algorithm updates the model parameters after processing the entire training dataset.
    • The cost function $ J(\theta) $ is first computed over all the training examples and then the gradient of the cost function with respect to the parameters is computed.
           $ \theta_{j} := \theta_{j} - \alpha (h_{\theta}(x) - y)x_{j} $ for every j


  • In SGD (1 update per data point):
    • updates the model parameters after processing each individual training example.
    • for each iteration, the algorithm randomly selects one training example, computes the gradient with respect to that example, and then updates the parameters based on that gradient.


  • BGD processes the entire training set at each iteration, which is computationally expensive but accurate.
  • SGD processes a signle training example at a time so that can coverge much faster.
  • While SGD has economical advantage over BGD, it may never be converge on global minimum, only oscillating around the local minimum.
  • Therefore, BGD can converge to the optimum more accurately and quickly on small datasets, while SGD can converge faster on large datasets.


Check for Convergence with Stochastic gradient descent


  • how to check SGD has convergd to global minimum (at least close)
  • how to tune learning rate α to get proper convergence?

  • Plotting $ J(\theta) $ averaged over N examples
    1. decrease learning rate (upper left)
      • slower the convergence
      • but obtain slightly better cost (negligible sometimes)
    2. increase N (>= 5000) (upper right)
      • also takes more time to plot (longer time to get single plotting point)
      • can smoothen the cost line
    3. increase N (lower left)
      • line will fluctuate too much, preventing you from seeing actual trend
      • if you elevate N, then you can see what’s actually going on
    4. decrease learning rate (lower right)
      • it shows that your algorithm fails to converge to minimum, (diverging, fails to find optimal parameters)
      • you should adjust your learning rate smaller, so that it can converge



  • Learning rate (α)
    • typically, α helds constant through entire learning process
    • but, can also slowly decrease α over time (if you want the model to converge better)
      • α = $\beta\,$ / ($\,$iterationNumber$\,$ + $\gamma$)
      • need to take additional time to decide what $\beta$ and $\gamma$ are
      • guaranteed to converge somewhere rathter than oscillating around it
  • SGD can be a good algorithm for massive training examples




3. Normal Equation

  • Closed-form solution for linear regression problems, which can be used to find the optimal parameters of a linear model. (only applicable to linear regresson case)
  • provides a way to compute the optimized parameter vector theta directly from the training data by solving the equation, without the need for an iterative optimization algorithm such as gradient descent.
  • Explicitly takes derivatives of cost function with respect to $\theta_{j}$s and solve by setting it to be 0.


3.1. Matrix Dervatives

Screen Shot 2023-03-22 at 10 20 37 PM


3.2. Properties of $\nabla$ and Trace of Matrix

Screen Shot 2023-03-22 at 10 20 42 PM


3.3 Least Mean Square solved with Normal Equation

Screen Shot 2023-03-22 at 10 20 47 PM

  • The amout of computations needed to solve normal equation depends on n (the number of parameters) with $O(n^{2})$
  • For dataset with smaller number of paramters, solving normal equation instead of iterative gradient descent will be efficient.