[Stanford CS229 03] Generalized Linear Model (GLM) and Softmax Regression
OUTLINES
- Exponential Family
- Generalized Linear Models
- Softmax Regression (Multiclass Classification)
1. Exponential Family
-
Distribution is said to belong to the exponential family if its probability density function (pdf) or probability mass function (pmf) can be expressed in the following form
$\large f(y\, |\ x\,;\theta) = b(y)\,e^{(\eta^{T}T(y)\, - \,a(\eta))} $
- y : response variable
- $\eta$ : natural parameter (link function, $f(\theta)$)
- $T(y)$ : sufficient statistics (function of y, mostly just T(y) = y)
- $b(y)$ : Base measure
- $a(\eta)$ : log partition number (normalizing parameter to make integral over entire domain be 1)
-
Sufficient Statistics $T(y)$
- a function that holds the sufficient information of the data needed to estimate the parameters of interest in statistical model
- if a sufficient statistic is available, then one can estimate the parameter of interest without using the full data
- can be found in the exponential family
- GLM uses only this sufficient statistics for an optimization process of parameters.
1.1. Probability Distribution within Exponential Family
-
The exponential family includes a wide range of commonly used probability distributions, such as the
normal distribution
,Poisson distribution
,gamma distribution
, andbinomial distribution
-
There are distinct data types matched for each probability distribution
- Gaussian : real numbers
- Bernoulli : binary discrete numbers
- Poisson : discrete, natural integer
- Gamma or Exponential : postivie real numbers
1.1.1. Gaussian Distribution
1.1.2. Bernoulli Distribution
1.1.3. Poisson Distribution
2. Generalized Linear Models (GLMs)
- Extends the linear regression model to handle data type not in normal distribution, such as binary or discrete count data
- To use GLMs, response variable (y) is assumed to be distributed in the form of exponential family
- exponential family form have link function (or response function) that links the non-normal response variable y to linear predictors (x parameterized by $\theta$)
- The GLM can be trained using maximum likelihood estimation or Bayesian methods, and the parameters of the model can be estimated using numerical optimization algorithms.
2.1. Maximum-Likelihood Function of GLMs
2.1.1. Properties of GLM
-
Convexity
: MLE with respect to $\eta$ is concave function (or Negative log likelihood is convex) -> guarantees convergence - $E(T(y)) = \large \frac{\partial a(\eta)}{\partial \eta}$
- $V(T(y)) = \large \frac{\partial^{2} a(\eta)}{\partial \eta^{2}}$ -> positive definite
2.1.2. Mean and Variance of Sufficient Statistics with Derivatives of $a(\eta)$
-
$E(T(y))$
- GLM is normalized with log partition number $a(\eta)$ so that its integral equals to 1.
- take derivative to the integral with respect to $\eta$
- can get the relation that $\,\,\, \large -\frac{\nabla g(\eta)}{g(\eta)}\, =\, \int T(y)g(\eta)b(y)e^{\eta^{T}T(y)}dy \,\, = E(T(y)) \,$ (here, $\large g(\eta)\, =\, e^{-a(\eta)} $)
-
$V(T(y))$
- take derivative to $E(T(y))$ with respect to $\eta$ to get $\large \frac{\partial^{2} a(\eta)}{\partial \eta^{2}}$
- take derivative to $E(T(y))$ with respect to $\eta$ to get $\large \frac{\partial^{2} a(\eta)}{\partial \eta^{2}}$
2.1.3. Maximizing Log Likelihood of GLM
-
take derivative to log likelihood with respect to $\eta$ and set it to be 0. (maximum point of concave function)
- solve the equation $\large \,\nabla a(\eta) = \frac{1}{N} \sum \limits_{i}^{N} T(y)$ gives you the natural parameter $\eta$ that maximizes the likelihood of GLM
- Hence, you only need to keep the sufficient statistics term for learning process, instead of storing the full data.
- as N (size of sample) goes to infinity, $\large \nabla a(\eta)$ reaches to $\large E(T(y))$
Design Choices for GLM in Machine Learning
- response variable (y) is from exponential family
- $\large \eta = \theta^{T}x$
- output $\,\,\large E(y\, |\ \,x;\theta) = h_{\theta}(x)$
3. Softmax Regression (Multiclass Classification)
- Known as
Multinomial Logistic Regression
, is a supervised learning algorithm used for classification problems where the output variable is categorical with more than two possible outcomes - Estimate the conditional probability distribution of the output variable (class) given the input variables
- Output variables $Y = {y_{1}, y_{2}, …, y_{k}, … y_{N}} $, each $y_{k}$ represents the probability that the given input $x$ belongs to the correspondig category k
- $\large \sum\limits_{k=1}^{N}\, y_{k}\, = \,1\,\,$ (N : number of categories)
3.1. Softmax Function ($h_{\theta}(x)$)
-
Transforms a vector of real numbers (input variables) into a probability distribution (output) by
exponentiating
andnormalizing
the values$\large p(y^{i}_{k}\, |\ x^{i} ; \theta)$
= $\large \frac{e^{z^{i}}}{\sum\limits_{j=1}^{N} \, e^{z^{i}}}$ $(here,\; z = \theta^{T}x^{i})$
3.2. Cost for softmax regression : Cross - Entropy
-
pretty much the same with the cost function (logistic cost) for binary classification
$\large CE(\hat{y}, y) = -\sum\limits_{k=1}^{N}y_{k}log(\hat{y}_{k}) $
- $\hat{y}^{i}_{k}\, $ : predicted probaility for category k
- $y^{i}$ : real label (1 for correct category and 0 for others)
- penalizes when the probaility is low for the correct category
- encourages the model to assign high probabilities to the correct classes and low probabilities to the incorrect classes.