OUTLINES

1. Generative Learning Algorithms

Generative Learning Algorithm
- model the underlying distribution of input features separately for each class (label, y)
- first model the $p(y)$ and $p(x\, |\ \,y)$ and use Bayes Rule to derive the posterior distribution of y given x
- match new example to each model and find the class (y) that maximizes $p(y\, |\ x)$
- include Naive Bayes, Gaussian Mixture Models (GDA), and Hidden Markov Models

Discriminative Learning Algorithm
- mapping the input features and output value $p(y\, |\ \,x)$
- directly predict the output based on input variables weighted by learned parameters
- no need to know underlying distribution of input space

as one of the generative learning algorithms, this model makes an assumption that $p(x\, |\ \,y)$ follows multivariate normal distribution

classification problem in which input features $x$ are continuous random variables distributed in normal form and $y \in {0, 1}$ follows Bernoulli distribution
tries to maximize the log-likelihood, which is the product of $p(x^{i}, y^{i} ; \phi, \mu_{0}, \mu_{1}, \Sigma)$

$\normalsize \ell(\phi, \mu_{0}, \mu_{1}, \Sigma) = log\,\prod\, p(x^{i}, y^{i} ; \phi, \mu_{0}, \mu_{1}, \Sigma)$

using Bayes Rule, can be expressed as

$\normalsize log\,\prod\, p(x^{i}\, |\ \, y^{i} ; \phi, \mu_{0}, \mu_{1}, \Sigma)\,p(y^{i}\,;\,\phi)$
Each distribution (class y=0 and y=1),
the result of MLE : By maximizing the $\ell$ with respect to each paramter, find the best estimates of the parameters,
Predcit : Then, we can find the class of each training example that maximizes the log likelihood function

$\normalsize y^{i} = argmax{\,p(y^{i}\, |\ \,x^{i})} = argmax(\,\large \frac{p(x^{i} |\ y^{i})\,p(y^{i})}{p(x^{i})})$

$p(x^{i})$ is no more than a common constant for both classes, can ignore the demoninator.

Hence, $\normalsize y^{i} = argmax(\,\large p(x^{i} |\ y^{i})\,p(y^{i}))$
Pictorically, what the algorithm is actually doing can be seen in as follows,
- In summary, GDA models the distribution of input features $p(x |\ y=0)$ and $p(x |\ y=1)$ and calculate the $p(y^{i} |\ x^{i})$ as a product of $p(x^{i} |\ y^{i}) p(y^{i})$ using Bayes rule.
- Then find the most likely output, maximizing the probability

If we view the quantity $p(y=1 \, |\ \, x \,;\, \phi, \mu_{0}, \mu_{1}, \Sigma)$ as the function of $x$, we can find that it can actually be expressed in the following form,

$p(y=1 \, |\ \, x \,;\, \phi, \mu_{0}, \mu_{1}, \Sigma)\,=\, \large \frac{1}{1\,+\,e^{-\theta^{T}x}}$ , where $\theta$ is an appropriate function of $\phi, \mu_{0}, \mu_{1}, \Sigma$
The converse, however, is not true. (logistic regression doesn’t guarantee normally distributed x).
This means that GDA is stronger modeling assumption than logistic regression.
Hence, as long as the assumption is correct, GDA can make better prediction than logistic regression.
In contrast, logistic regression is less sensitive to incorrect modeling assumptions so that it’s not significantly affected by the actual distrtibution of data (for example, Poisson distribution also makes $p(y |\ x)$ logistic)
To summarize, GDA can be more efficient and has better fit to the data when the modeling assumptions are at least approximately correct.
Logistic regression makes wearker assumptions, thus more robust to deviations from the modeling assumptions

Probabilistic classifiers based on applying Bayes’ theorem with strong Naive Bayes (NB) assumptions between the features
NB assumption assumes that each input feature is conditionally independent to each other given y (class), which is highly unlikely in reality.
this algorithm still works okay even with this very “naive” assumption and provides clear advantage in terms of computational efficiency
But for the data where input features are strongly correlated, the assumptions significantly limit its accuracy.

build a spam classifier that automatically classifies the email into spam or non-spam using Naive Bayes algorithm
Training set :
- given an email with labeled with 1 for spam ($y^{i} = 1$) and 0 for non-spam ($y^{i} = 0$)
- construct a feature vector whose lengith is equal to the number of words in vocab dictionary and each $jth$ feature represents whether $jth$ vocabulary is present in the mail $(x^{i}{j} = 1)$ or not $(x^{i}{j} = 0)$
$ith$ email : $x^{i} = \begin{bmatrix} 1 \ 0 \ 0 \ .\.\.\1\0 \end{bmatrix}$

model $\normalsize p(x |\ y)$ :
- use NB assumption that features are conditionally independent within a class
Log-Likelihood function

$\normalsize L(\phi_{y}, \phi_{(j |\ y=0)}, \phi_{(j |\ y=1)}) = \prod_{i=1}^{m} \,p(x^{i}, y^{i})$

$p(x^{i}, y^{i}) = \prod_{j=1}^{n}\,p(x^{i}{j} |\ y)\,p(y)$ , where each $p(x^{i}{j} |\ y)$ and $p(y)$ follows Bernoulli distribution
MLE estimates
Prediction
- find $argmax(y)\,\,p(y |\ x)$
- repeat for y = 0, and select the class with max probability