[Stanford CS229 04] Generative Leaning - GDA & Naive Bayes
OUTLINES
- Generative Learning Algorithms
- GDA
- Naive Bayes
1. Generative Learning Algorithms
Generative Learning Algorithm
- model the underlying distribution of input features separately for each class (label, y)
- first model the $p(y)$ and $p(x\, |\ \,y)$ and use
Bayes Rule
to derive the posterior distribution of y given x - match new example to each model and find the class (y) that maximizes $p(y\, |\ x)$
- include Naive Bayes, Gaussian Mixture Models (GDA), and Hidden Markov Models
Discriminative Learning Algorithm
- mapping the input features and output value $p(y\, |\ \,x)$
- directly predict the output based on input variables weighted by learned parameters
- no need to know underlying distribution of input space
2. Gaussian Discriminative Analysis
- as one of the generative learning algorithms, this model makes an assumption that $p(x\, |\ \,y)$ follows multivariate normal distribution
2.1. Multivariate Normal Distribution
-
$p(x |\ y)$ is parameterized by mean vector and Covariance matrix
- Mean vector : $\normalsize \mu\,\in\mathbb{R}^{n}$
- Covariance matrix : $\normalsize \Sigma \in \mathbb{R}^{n \times n}$, where $\Sigma \geq 0$ is symmetric and positive definite
$\normalsize p(x\, |\ \,y)\, \sim \,N(\mu,\,\Sigma)$
$\normalsize p(x ; \,\mu, \,\Sigma) = \frac{1}{(2\pi)^{n/2} |\ \Sigma |\ ^{1/2}}\,exp(-\frac{1}{2}\,(x - \mu)^{T}\,\Sigma^{-1}\,(x-\mu))$
$\normalsize E(x)\, =\, \mu$
$\normalsize \,Cov(X) = E((X\,-\,E(X))(X\,-\,E(X))^{T})\,=\, E(XX^{T}) - E(X)E(X)^{T}$
-
Density of Multivariate Gaussian Distribution
varies by $\Sigma$ and $\mu$- Diagnal entries of $\Sigma$ : determines the compression of pdf with respect to the direction parallel to each axis
- $\Sigma = I$ : standard normal distribution
- each represents pdf with $\Sigma$ equals to I , 2I, 0.4I, respectively
- Off-diagonal entries (symmetric) : determines the compression towards the $45^{\circ}$ line between the axes of each feature
- $\Sigma = \begin{bmatrix} 1\quad 0 \ 0\quad 1 \end{bmatrix}$, $\Sigma = \begin{bmatrix} 1\quad 0.5 \ 0.5\quad 1 \end{bmatrix}$, $\Sigma = \begin{bmatrix} 1\quad 0.8 \ 0.8\quad 1 \end{bmatrix}$
- $\Sigma = \begin{bmatrix} 1\quad 0 \ 0\quad 1 \end{bmatrix}$, $\Sigma = \begin{bmatrix} 1\quad 0.5 \ 0.5\quad 1 \end{bmatrix}$, $\Sigma = \begin{bmatrix} 1\quad 0.8 \ 0.8\quad 1 \end{bmatrix}$
- varying $\mu$ moves the distribution along the axis
- Diagnal entries of $\Sigma$ : determines the compression of pdf with respect to the direction parallel to each axis
2.2. The Gaussian Discriminant Analysis (GDA) Model
-
classification problem in which input features $x$ are continuous random variables distributed in normal form and $y \in {0, 1}$ follows Bernoulli distribution
-
tries to maximize the log-likelihood, which is the product of $p(x^{i}, y^{i} ; \phi, \mu_{0}, \mu_{1}, \Sigma)$
$\normalsize \ell(\phi, \mu_{0}, \mu_{1}, \Sigma) = log\,\prod\, p(x^{i}, y^{i} ; \phi, \mu_{0}, \mu_{1}, \Sigma)$
using Bayes Rule, can be expressed as
$\normalsize log\,\prod\, p(x^{i}\, |\ \, y^{i} ; \phi, \mu_{0}, \mu_{1}, \Sigma)\,p(y^{i}\,;\,\phi)$
-
Each distribution (class y=0 and y=1),
-
the result of MLE : By maximizing the $\ell$ with respect to each paramter, find the best estimates of the parameters,
-
Predcit : Then, we can find the class of each training example that maximizes the log likelihood function
$\normalsize y^{i} = argmax{\,p(y^{i}\, |\ \,x^{i})} = argmax(\,\large \frac{p(x^{i} |\ y^{i})\,p(y^{i})}{p(x^{i})})$
$p(x^{i})$ is no more than a common constant for both classes, can ignore the demoninator.
Hence, $\normalsize y^{i} = argmax(\,\large p(x^{i} |\ y^{i})\,p(y^{i}))$
-
Pictorically, what the algorithm is actually doing can be seen in as follows,
- In summary, GDA models the distribution of input features $p(x |\ y=0)$ and $p(x |\ y=1)$ and calculate the $p(y^{i} |\ x^{i})$ as a product of $p(x^{i} |\ y^{i}) p(y^{i})$ using Bayes rule.
- Then find the most likely output, maximizing the probability
2.3. GDA vs Logistic Regression
-
If we view the quantity $p(y=1 \, |\ \, x \,;\, \phi, \mu_{0}, \mu_{1}, \Sigma)$ as the function of $x$, we can find that it can actually be expressed in the following form,
$p(y=1 \, |\ \, x \,;\, \phi, \mu_{0}, \mu_{1}, \Sigma)\,=\, \large \frac{1}{1\,+\,e^{-\theta^{T}x}}$ , where $\theta$ is an appropriate function of $\phi, \mu_{0}, \mu_{1}, \Sigma$
-
The converse, however, is not true. (logistic regression doesn’t guarantee normally distributed x).
This means that GDA is stronger modeling assumption than logistic regression.
Hence, as long as the assumption is correct, GDA can make better prediction than logistic regression. -
In contrast, logistic regression is less sensitive to incorrect modeling assumptions so that it’s not significantly affected by the actual distrtibution of data (for example, Poisson distribution also makes $p(y |\ x)$ logistic)
-
To summarize, GDA can be more efficient and has better fit to the data when the modeling assumptions are at least approximately correct.
Logistic regression makes wearker assumptions, thus more robust to deviations from the modeling assumptions
3. Naive Bayes
- Probabilistic classifiers based on applying Bayes’ theorem with strong Naive Bayes (NB) assumptions between the features
- NB assumption assumes that each input feature is conditionally independent to each other given y (class), which is highly unlikely in reality.
- this algorithm still works okay even with this very “naive” assumption and provides clear advantage in terms of computational efficiency
- But for the data where input features are strongly correlated, the assumptions significantly limit its accuracy.
3.1 Application of NB Algorithm as a Spam Classifier
- build a spam classifier that automatically classifies the email into spam or non-spam using Naive Bayes algorithm
Training set
:- given an email with labeled with 1 for spam ($y^{i} = 1$) and 0 for non-spam ($y^{i} = 0$)
- construct a feature vector whose lengith is equal to the number of words in vocab dictionary and each $jth$ feature represents whether $jth$ vocabulary is present in the mail $(x^{i}{j} = 1)$ or not $(x^{i}{j} = 0)$
$ith$ email : $x^{i} = \begin{bmatrix} 1 \ 0 \ 0 \ .\.\.\1\0 \end{bmatrix}$
- model $\normalsize p(x |\ y)$ :
- use NB assumption that features are conditionally independent within a class
-
Log-Likelihood function
$\normalsize L(\phi_{y}, \phi_{(j |\ y=0)}, \phi_{(j |\ y=1)}) = \prod_{i=1}^{m} \,p(x^{i}, y^{i})$
$p(x^{i}, y^{i}) = \prod_{j=1}^{n}\,p(x^{i}{j} |\ y)\,p(y)$ , where each $p(x^{i}{j} |\ y)$ and $p(y)$ follows Bernoulli distribution
-
MLE estimates
- Prediction
- find $argmax(y)\,\,p(y |\ x)$
- repeat for y = 0, and select the class with max probability