![[Stanford CS229 02] Locally Weighted Regression and Logistic Regression](/assets/img/ml/cs229.png)
[Stanford CS229 02] Locally Weighted Regression and Logistic Regression
- Locally Weighted Regression
- Probabilistic Interpretation (Maximum Log Likelihood)
- Logistic Regression
- Newton’s method
1. Locally Weighted Regression
- fitting a model to a dataset by giving more weight to the data points that are close to the point being predicted
- Non-parametric learning algorithm where the number of parameters you need to keep grows with the size of the dataset, while parametric learning has fixed set of parameters.
1.1. Cost function to minimize
$\normalsize \sum\limits_{i=1}^{m} \omega^{i}(y^{i} - \theta^{T}x^{i})^{2}$ where $\normalsize \omega^{i} = exp(\frac{- (x^{i} - x)^{2}}{2}) $
- Weighting function $\omega^{i}$ : used to assign a weight to each training example based on its distance from the point being predicted.
- $x^{i} \,$ : data points that you’re processing
- $x \,$ : point of interest to be predicted
- Automatically gives more weight to the points of close to $x$ (max weight = 1)
- Points too far from the point of interest will fade away with infinitesimally small weight $\omega^{i}$
- locally fit an almost straight line centered at the point to be predicted.
1.2. $\normalsize \tau$ : bandwidth parameter
$\large \omega^{i} = exp(\frac{-(x^{i} - x)^{2}}{2\tau^{2}})$
- Weight term depends on the choice of $\large \tau$
- this controls how quickly the weight is adjusted by the distance of data points from the point to be predicted.
- called as bandwith parameter as it determines the width of linearly fitted local area with respect to the query point.
2. Probabilistic Interpretation of Least Mena Square
- Conver the problem from
minimizing error term
tomaximize the probability
of $y^{i}$ given with $x^{i}\,$ parameterized by $\theta$ - Can make an assumption that $\epsilon^{i}$ are distributed IID (independently and identically distributed)
According to the Central Limit Theorem (CLT) with large enough training examples, $\epsilon^{i}$ converges to Gaussian Distribution
$\normalsize \epsilon^{i} \sim~ \mathcal{N}(\mu = 0,\,\sigma^{2})\,$)
- This implies that :
- the distribution of $y^{i}$ given $x^{i}\,$ parameterized by $\theta$ follows the Gaussian Distribution of average $\theta^{T}x$ and variance $\sigma^{2}$
$\normalsize p(y^{i}\, | \,x^{i};\,\theta)\, \sim~ \,\frac{1}{\sqrt{2\pi}\sigma}exp(\frac{-(y^{i}\,-\theta^{T}x^{i})}{2\sigma^{2}})$
The function $p(y^{i}\, | \,x^{i};\,\theta)$ can be explicitly veiwed as the likelihood of $y$ for a varying $\theta$
$\normalsize L(\theta)\,=\,p(y^{i}\, | \,x^{i};\,\theta)$
2.1. Likelihood Function : $ L(\theta)$
as we’ve made an IID assumption, the likelihood for entire training set can be computed as the product of each probability of $y^{i}$.
- Given this likelihood function, our probelm turn into finding the sets of $\theta$ that maximizies the probabilistic distribution of $y$ given by the $x$
As the function $L(\theta)$ contains exponential term, we can make it simpler by taking log to the function to make it linear and also turn the product into summed form.
- Hence, maximizing $\ell(\theta)$ actually becomes same as minimizing $\sum\limits_{i=1}^{m}(y^{i}\,-\,\theta^{T}x^{i})$, which is the error term we’ve seen before.
- To summarize, optimizing $\theta$ with least-square approach to error term ($\epsilon^{i}$) corresponds to finding $\theta$ that gives maximized likelihood distribution of $p(y^{i})$
3. Classification with Logistic Regression
- Logistic regression is used for the binary classification in which y takes only two discrete values, 0 and 1.
- LR models the probability that the $y^{i}$ takes on a particular value given the $x^{i} $ parameterized by $\theta$.
Logistic function, which maps the input values to a value between 0 and 1, representing the probability of $y^{i}$ taking the value 1.
To map the input values ($x$) to proability with range [0, 1], we need to change the form of
hypothese function using sigmoid function
that converts the input values defined from negative to positive infinity into the output values from 0 to 1.$ \normalsize h_{\theta}(x) = g(\theta^{T}x) = \large \frac{1}{1+e^{-\theta^{T}x}} $ where $\normalsize \, g(z)\,=\,\frac{1}{1+e^{-z}}$
- $g(z)$ goes toward 1 as z goes to positive infinity and 0 as z goes to negative infinity, bounded by [0, 1]
3.1. Maximum Likelihood Estimator
- To fit the best estimate of $\theta$, we need to define the likelihood function for logistic classifier same as we did for linear regression.
Probaility Function : get $(h_{\theta}(x))$ when y = 1 and get $1\,-\,(h_{\theta}(x))$ when y equals to 0
$ P(y\,=\,1\, | x;\theta\,)\,=\,h_{\theta}(x) $
$ P(y\,=\,0\, | x;\theta\,)\,=\,1\,-\,h_{\theta}(x)$
Both combined, $ P(y\,|\,x;\theta\,)\,=\,(h_{\theta}(x))^{y}\,(1\,-h_{\theta}(x))^{1-y} $ -
Each data point is in IID, likelihood for entire dataset equals to product of the probability for each $y^{i}$
Log Likelihood
for easier Optimization :
$\normalsize \ell(\theta)\,=\,logL(\theta) = \sum\limits_{i=1}^{m}\,y\,log(h_{\theta})\,+\,(1-y)\,log(1\,-\,h_{\theta}(x))$
3.2. Maximization with Gradient Ascent
$\normalsize \theta_{j} := \theta_{j} \,\, + \,\, \alpha\frac{\partial \ell(\theta)}{\partial \theta_{j}}$
- The explicit form of optimizing equation looks almost identical with gradient descent for linear regression, but the hypotheses function ($h_{\theta}(x)$) is different.
4. Newton’s Algorithm
- Newton’s algorithm, also known as Newton-Raphson method, is an iterative numerical method for finding the roots of a differentiable function (root : the point where $ f(x)\, = \,0$).
Finds the root of first derivative of log likelihood function ($\ell’(\theta)$) using sercond derivative.
- set initial $\theta_{j}$ as random value and approximates next optimized $\theta_{j}$ by drawing a line tangent to the function at the currest guess of $\theta$
- solve for the point where that linear function equals to zero.
- repeat 1. and 2. untll covergence of $\theta$
- Advantage of Newton’s method is that it takes less computations needed to converge each $\theta$
- But the amount of computations grows with the number of parameters to fit.