[Stanford CS229 03] Generalized Linear Model (GLM) and Softmax Regression
- Exponential Family
- Generalized Linear Models
- Softmax Regression (Multiclass Classification)
1. Exponential Family
Distribution is said to belong to the exponential family if its probability density function (pdf) or probability mass function (pmf) can be expressed in the following form
$\large f(y\, |\ x\,;\theta) = b(y)\,e^{(\eta^{T}T(y)\, - \,a(\eta))} $
- y : response variable
- $\eta$ : natural parameter (link function, $f(\theta)$)
- $T(y)$ : sufficient statistics (function of y, mostly just T(y) = y)
- $b(y)$ : Base measure
- $a(\eta)$ : log partition number (normalizing parameter to make integral over entire domain be 1)
Sufficient Statistics $T(y)$
- a function that holds the sufficient information of the data needed to estimate the parameters of interest in statistical model
- if a sufficient statistic is available, then one can estimate the parameter of interest without using the full data
- can be found in the exponential family
- GLM uses only this sufficient statistics for an optimization process of parameters.
1.1. Probability Distribution within Exponential Family
The exponential family includes a wide range of commonly used probability distributions, such as the
normal distribution
,Poisson distribution
,gamma distribution
, andbinomial distribution
There are distinct data types matched for each probability distribution
- Gaussian : real numbers
- Bernoulli : binary discrete numbers
- Poisson : discrete, natural integer
- Gamma or Exponential : postivie real numbers
1.1.1. Gaussian Distribution
1.1.2. Bernoulli Distribution
1.1.3. Poisson Distribution
2. Generalized Linear Models (GLMs)
- Extends the linear regression model to handle data type not in normal distribution, such as binary or discrete count data
- To use GLMs, response variable (y) is assumed to be distributed in the form of exponential family
- exponential family form have link function (or response function) that links the non-normal response variable y to linear predictors (x parameterized by $\theta$)
- The GLM can be trained using maximum likelihood estimation or Bayesian methods, and the parameters of the model can be estimated using numerical optimization algorithms.
2.1. Maximum-Likelihood Function of GLMs
2.1.1. Properties of GLM
: MLE with respect to $\eta$ is concave function (or Negative log likelihood is convex) -> guarantees convergence - $E(T(y)) = \large \frac{\partial a(\eta)}{\partial \eta}$
- $V(T(y)) = \large \frac{\partial^{2} a(\eta)}{\partial \eta^{2}}$ -> positive definite
2.1.2. Mean and Variance of Sufficient Statistics with Derivatives of $a(\eta)$
- GLM is normalized with log partition number $a(\eta)$ so that its integral equals to 1.
- take derivative to the integral with respect to $\eta$
- can get the relation that $\,\,\, \large -\frac{\nabla g(\eta)}{g(\eta)}\, =\, \int T(y)g(\eta)b(y)e^{\eta^{T}T(y)}dy \,\, = E(T(y)) \,$ (here, $\large g(\eta)\, =\, e^{-a(\eta)} $)
- take derivative to $E(T(y))$ with respect to $\eta$ to get $\large \frac{\partial^{2} a(\eta)}{\partial \eta^{2}}$
- take derivative to $E(T(y))$ with respect to $\eta$ to get $\large \frac{\partial^{2} a(\eta)}{\partial \eta^{2}}$
2.1.3. Maximizing Log Likelihood of GLM
take derivative to log likelihood with respect to $\eta$ and set it to be 0. (maximum point of concave function)
- solve the equation $\large \,\nabla a(\eta) = \frac{1}{N} \sum \limits_{i}^{N} T(y)$ gives you the natural parameter $\eta$ that maximizes the likelihood of GLM
- Hence, you only need to keep the sufficient statistics term for learning process, instead of storing the full data.
- as N (size of sample) goes to infinity, $\large \nabla a(\eta)$ reaches to $\large E(T(y))$
Design Choices for GLM in Machine Learning
- response variable (y) is from exponential family
- $\large \eta = \theta^{T}x$
- output $\,\,\large E(y\, |\ \,x;\theta) = h_{\theta}(x)$
3. Softmax Regression (Multiclass Classification)
- Known as
Multinomial Logistic Regression
, is a supervised learning algorithm used for classification problems where the output variable is categorical with more than two possible outcomes - Estimate the conditional probability distribution of the output variable (class) given the input variables
- Output variables $Y = {y_{1}, y_{2}, …, y_{k}, … y_{N}} $, each $y_{k}$ represents the probability that the given input $x$ belongs to the correspondig category k
- $\large \sum\limits_{k=1}^{N}\, y_{k}\, = \,1\,\,$ (N : number of categories)
3.1. Softmax Function ($h_{\theta}(x)$)
Transforms a vector of real numbers (input variables) into a probability distribution (output) by
the values$\large p(y^{i}_{k}\, |\ x^{i} ; \theta)$
= $\large \frac{e^{z^{i}}}{\sum\limits_{j=1}^{N} \, e^{z^{i}}}$ $(here,\; z = \theta^{T}x^{i})$
3.2. Cost for softmax regression : Cross - Entropy
pretty much the same with the cost function (logistic cost) for binary classification
$\large CE(\hat{y}, y) = -\sum\limits_{k=1}^{N}y_{k}log(\hat{y}_{k}) $
- $\hat{y}^{i}_{k}\, $ : predicted probaility for category k
- $y^{i}$ : real label (1 for correct category and 0 for others)
- penalizes when the probaility is low for the correct category
- encourages the model to assign high probabilities to the correct classes and low probabilities to the incorrect classes.