[Stanford CS229 03] Generalized Linear Model (GLM) and Softmax Regression

[Stanford CS229 03] Generalized Linear Model (GLM) and Softmax Regression

2023, Mar 05    


OUTLINES

  1. Exponential Family
  2. Generalized Linear Models
  3. Softmax Regression (Multiclass Classification)


1. Exponential Family


  • Distribution is said to belong to the exponential family if its probability density function (pdf) or probability mass function (pmf) can be expressed in the following form

         $\large f(y\, |\ x\,;\theta) = b(y)\,e^{(\eta^{T}T(y)\, - \,a(\eta))} $

    • y : response variable
    • $\eta$ : natural parameter (link function, $f(\theta)$)
    • $T(y)$ : sufficient statistics (function of y, mostly just T(y) = y)
    • $b(y)$ : Base measure
    • $a(\eta)$ : log partition number (normalizing parameter to make integral over entire domain be 1)


  • Sufficient Statistics $T(y)$

    • a function that holds the sufficient information of the data needed to estimate the parameters of interest in statistical model
    • if a sufficient statistic is available, then one can estimate the parameter of interest without using the full data
    • can be found in the exponential family
    • GLM uses only this sufficient statistics for an optimization process of parameters.


1.1. Probability Distribution within Exponential Family


  • The exponential family includes a wide range of commonly used probability distributions, such as the normal distribution, Poisson distribution, gamma distribution, and binomial distribution

  • There are distinct data types matched for each probability distribution

    • Gaussian : real numbers
    • Bernoulli : binary discrete numbers
    • Poisson : discrete, natural integer
    • Gamma or Exponential : postivie real numbers


1.1.1. Gaussian Distribution


Screen Shot 2023-03-23 at 9 48 57 PM


1.1.2. Bernoulli Distribution


Screen Shot 2023-03-23 at 9 49 02 PM


1.1.3. Poisson Distribution


Screen Shot 2023-03-23 at 9 49 05 PM




2. Generalized Linear Models (GLMs)


  • Extends the linear regression model to handle data type not in normal distribution, such as binary or discrete count data
  • To use GLMs, response variable (y) is assumed to be distributed in the form of exponential family
  • exponential family form have link function (or response function) that links the non-normal response variable y to linear predictors (x parameterized by $\theta$)
  • The GLM can be trained using maximum likelihood estimation or Bayesian methods, and the parameters of the model can be estimated using numerical optimization algorithms.


2.1. Maximum-Likelihood Function of GLMs


2.1.1. Properties of GLM


  • Convexity : MLE with respect to $\eta$ is concave function (or Negative log likelihood is convex) -> guarantees convergence

  • $E(T(y)) = \large \frac{\partial a(\eta)}{\partial \eta}$
  • $V(T(y)) = \large \frac{\partial^{2} a(\eta)}{\partial \eta^{2}}$ -> positive definite


2.1.2. Mean and Variance of Sufficient Statistics with Derivatives of $a(\eta)$


  • $E(T(y))$

    1. GLM is normalized with log partition number $a(\eta)$ so that its integral equals to 1.
    2. take derivative to the integral with respect to $\eta$
    3. can get the relation that $\,\,\, \large -\frac{\nabla g(\eta)}{g(\eta)}\, =\, \int T(y)g(\eta)b(y)e^{\eta^{T}T(y)}dy \,\, = E(T(y)) \,$ (here, $\large g(\eta)\, =\, e^{-a(\eta)} $)

    Screen Shot 2023-04-04 at 9 16 13 PM

  • $V(T(y))$

    • take derivative to $E(T(y))$ with respect to $\eta$ to get $\large \frac{\partial^{2} a(\eta)}{\partial \eta^{2}}$

    Screen Shot 2023-04-04 at 9 17 06 PM


2.1.3. Maximizing Log Likelihood of GLM


  • take derivative to log likelihood with respect to $\eta$ and set it to be 0. (maximum point of concave function)

    Screen Shot 2023-04-04 at 9 17 12 PM

    • solve the equation $\large \,\nabla a(\eta) = \frac{1}{N} \sum \limits_{i}^{N} T(y)$ gives you the natural parameter $\eta$ that maximizes the likelihood of GLM
    • Hence, you only need to keep the sufficient statistics term for learning process, instead of storing the full data.
    • as N (size of sample) goes to infinity, $\large \nabla a(\eta)$ reaches to $\large E(T(y))$


Design Choices for GLM in Machine Learning


  1. response variable (y) is from exponential family
  2. $\large \eta = \theta^{T}x$
  3. output $\,\,\large E(y\, |\ \,x;\theta) = h_{\theta}(x)$

Screen Shot 2023-04-04 at 9 17 39 PM




3. Softmax Regression (Multiclass Classification)


  • Known as Multinomial Logistic Regression, is a supervised learning algorithm used for classification problems where the output variable is categorical with more than two possible outcomes
  • Estimate the conditional probability distribution of the output variable (class) given the input variables
  • Output variables $Y = {y_{1}, y_{2}, …, y_{k}, … y_{N}} $, each $y_{k}$ represents the probability that the given input $x$ belongs to the correspondig category k
    • $\large \sum\limits_{k=1}^{N}\, y_{k}\, = \,1\,\,$ (N : number of categories)


3.1. Softmax Function ($h_{\theta}(x)$)


  • Transforms a vector of real numbers (input variables) into a probability distribution (output) by exponentiating and normalizing the values

        $\large p(y^{i}_{k}\, |\ x^{i} ; \theta)$

         = $\large \frac{e^{z^{i}}}{\sum\limits_{j=1}^{N} \, e^{z^{i}}}$ $(here,\; z = \theta^{T}x^{i})$


3.2. Cost for softmax regression : Cross - Entropy


  • pretty much the same with the cost function (logistic cost) for binary classification

        $\large CE(\hat{y}, y) = -\sum\limits_{k=1}^{N}y_{k}log(\hat{y}_{k}) $

    • $\hat{y}^{i}_{k}\, $ : predicted probaility for category k
    • $y^{i}$ : real label (1 for correct category and 0 for others)
  • penalizes when the probaility is low for the correct category
  • encourages the model to assign high probabilities to the correct classes and low probabilities to the incorrect classes.