OUTLINES

1. Exponential Family

Distribution is said to belong to the exponential family if its probability density function (pdf) or probability mass function (pmf) can be expressed in the following form

$\large f(y\, |\ x\,;\theta) = b(y)\,e^{(\eta^{T}T(y)\, - \,a(\eta))} $
- y : response variable
- $\eta$ : natural parameter (link function, $f(\theta)$)
- $T(y)$ : sufficient statistics (function of y, mostly just T(y) = y)
- $b(y)$ : Base measure
- $a(\eta)$ : log partition number (normalizing parameter to make integral over entire domain be 1)
Sufficient Statistics $T(y)$
- a function that holds the sufficient information of the data needed to estimate the parameters of interest in statistical model
- if a sufficient statistic is available, then one can estimate the parameter of interest without using the full data
- can be found in the exponential family
- GLM uses only this sufficient statistics for an optimization process of parameters.

The exponential family includes a wide range of commonly used probability distributions, such as the normal distribution, Poisson distribution, gamma distribution, and binomial distribution
There are distinct data types matched for each probability distribution
- Gaussian : real numbers
- Bernoulli : binary discrete numbers
- Poisson : discrete, natural integer
- Gamma or Exponential : postivie real numbers

Screen Shot 2023-03-23 at 9 48 57 PM

Screen Shot 2023-03-23 at 9 49 02 PM

Screen Shot 2023-03-23 at 9 49 05 PM

Extends the linear regression model to handle data type not in normal distribution, such as binary or discrete count data
To use GLMs, response variable (y) is assumed to be distributed in the form of exponential family
exponential family form have link function (or response function) that links the non-normal response variable y to linear predictors (x parameterized by $\theta$)
The GLM can be trained using maximum likelihood estimation or Bayesian methods, and the parameters of the model can be estimated using numerical optimization algorithms.

Convexity : MLE with respect to $\eta$ is concave function (or Negative log likelihood is convex) -> guarantees convergence
$E(T(y)) = \large \frac{\partial a(\eta)}{\partial \eta}$
$V(T(y)) = \large \frac{\partial^{2} a(\eta)}{\partial \eta^{2}}$ -> positive definite

$E(T(y))$
1. GLM is normalized with log partition number $a(\eta)$ so that its integral equals to 1.
2. take derivative to the integral with respect to $\eta$
3. can get the relation that $\,\,\, \large -\frac{\nabla g(\eta)}{g(\eta)}\, =\, \int T(y)g(\eta)b(y)e^{\eta^{T}T(y)}dy \,\, = E(T(y)) \,$ (here, $\large g(\eta)\, =\, e^{-a(\eta)} $)
$V(T(y))$
- take derivative to $E(T(y))$ with respect to $\eta$ to get $\large \frac{\partial^{2} a(\eta)}{\partial \eta^{2}}$

take derivative to log likelihood with respect to $\eta$ and set it to be 0. (maximum point of concave function)
- solve the equation $\large \,\nabla a(\eta) = \frac{1}{N} \sum \limits_{i}^{N} T(y)$ gives you the natural parameter $\eta$ that maximizes the likelihood of GLM
- Hence, you only need to keep the sufficient statistics term for learning process, instead of storing the full data.
- as N (size of sample) goes to infinity, $\large \nabla a(\eta)$ reaches to $\large E(T(y))$

Screen Shot 2023-04-04 at 9 17 39 PM

Known as Multinomial Logistic Regression, is a supervised learning algorithm used for classification problems where the output variable is categorical with more than two possible outcomes
Estimate the conditional probability distribution of the output variable (class) given the input variables
Output variables $Y = {y_{1}, y_{2}, …, y_{k}, … y_{N}} $, each $y_{k}$ represents the probability that the given input $x$ belongs to the correspondig category k
- $\large \sum\limits_{k=1}^{N}\, y_{k}\, = \,1\,\,$ (N : number of categories)

Transforms a vector of real numbers (input variables) into a probability distribution (output) by exponentiating and normalizing the values

$\large p(y^{i}_{k}\, |\ x^{i} ; \theta)$

= $\large \frac{e^{z^{i}}}{\sum\limits_{j=1}^{N} \, e^{z^{i}}}$ $(here,\; z = \theta^{T}x^{i})$

pretty much the same with the cost function (logistic cost) for binary classification

$\large CE(\hat{y}, y) = -\sum\limits_{k=1}^{N}y_{k}log(\hat{y}_{k}) $
- $\hat{y}^{i}_{k}\, $ : predicted probaility for category k
- $y^{i}$ : real label (1 for correct category and 0 for others)
penalizes when the probaility is low for the correct category
encourages the model to assign high probabilities to the correct classes and low probabilities to the incorrect classes.