Multinomial Logisitc, LDA, QDA
            Cornell College 
 STA 362 Spring 2024 Block 8
          
We had a logistic regression refresher
Now…
So far we have discussed logistic regression with two classes.
It is easily generalized to more than two classes.
Recall our defaults data with variable default, student, and balance
What is going on here?
\[\log\left(\frac{p(X)}{1-p(X)}\right)=\beta_0+\beta_1X_1+\dots+\beta_pX_p\] \[p(X) = \frac{e^{\beta_0+\beta_1X_1+\dots+\beta_pX_p}}{1+e^{\beta_0+\beta_1X_1+\dots+\beta_pX_p}}\]
| term | estimate | std.error | statistic | p.value | 
|---|---|---|---|---|
| (Intercept) | -10.8690452 | 0.4922555 | -22.080088 | 0.0000000 | 
| balance | 0.0057365 | 0.0002319 | 24.737563 | 0.0000000 | 
| income | 0.0000030 | 0.0000082 | 0.369815 | 0.7115203 | 
| studentYes | -0.6467758 | 0.2362525 | -2.737646 | 0.0061881 | 
student negative now when it was positive before?\[P(Y=k|X) = \frac{e ^{\beta_{0k}+\beta_{1k}X_1+\dots+\beta_{pk}X_p}}{\sum_{l=1}^Ke^{\beta_{0l}+\beta_{1l}X_1+\dots+\beta_{pl}X_p}}\]
To give us a general overview, we are going to watch the StatQuest video on the topic: https://www.youtube.com/watch?v=azXCzI57Yfc
Here the approach is to model the distribution of X in each of the classes separately, and then use Bayes theorem to flip things around and obtain \(P(Y|X)\).
When we use normal (Gaussian) distributions for each class, this leads to linear or quadratic discriminant analysis.
However, this approach is quite general, and other distributions can be used as well. We will focus on normal distributions.
When the classes are well-separated, the parameter estimates for the logistic regression model are surprisingly unstable. Linear discriminant analysis does not suffer from this problem.
If n is small and the distribution of the predictors X is approximately normal in each of the classes, the linear discriminant model is again more stable than the logistic regression model.
Linear discriminant analysis is popular when we have more than two response classes, because it also provides low-dimensional views of the data.
Thomas Bayes was a famous mathematician whose name represents a big subfield of statistical and probabilistic modeling. Here we focus on a simple result, known as Bayes theorem:
\[P(Y=k|X=x) = \frac{P(X=x|Y=k)\cdot P(Y=k)}{P(X=x)}\]
\[P(Y=k|X=x) = \frac{\pi_kf_k(x)}{\sum_{l=1}^K\pi_lf_l(x)} \text{, where}\]
\(f_k(x)=P(X=x|Y=k)\) is the density for \(X\) in class \(k\). Here we use normal’s but they could be other distributions (such as \(\chi^2\))
\(\pi_k = P(Y=k)\) is the marginal or prior probability for class \(k\).
\[\pi_1=.5, \pi_2=.5\]
We classify a new point according to which density is highest.
When the priors are different, we take them into account as well, and compare \(\pi_kf_k(x)\).
On the right, we favor the pink class - the decision boundary has shifted to the left.
The Gaussian (normal) density has the form
\[f_k(x) = \frac{1}{\sqrt{2\pi}\sigma_k}e^{-\frac{1}{2}(\frac{x-\mu_k}{\sigma_k})^2}\]
\(\mu_k\) is the mean, \(\sigma_k^2\) the variance (in class \(k\))
For now, we assume \(\sigma_k=\sigma\) for all groups (we will need to check this with real data)
We plug this \(f_k(x)\) into Bayes formula and after some simplifying we get:
\[p_k(x) = \frac{\pi_k\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2}(\frac{x-\mu_k}{\sigma_k})^2}}{\sum_{l=1}^K\pi_l\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2}(\frac{x-\mu_k}{\sigma_k})^2}}\]
To classify at the value X = x, we need to see which of the \(p_k(x)\) is largest. Taking logs, and discarding terms that do not depend on \(k\), we see that this is equivalent to assigning x to the class with the largest discriminant score:
\[\delta_k(x) = x\cdot \frac{\mu_k}{\sigma^2}-\frac{\mu_k^2}{2\sigma^2}+log(\pi_k)\]
Importantly, \(\delta_k(x)\) is a linear function of \(x\).
If there are \(K=2\) classes and \(\pi_1=\pi_2=.5\), then the decision boundry is at
\[x=\frac{\mu_1+\mu_2}{2}\]
What should we estimate \(\hat{\pi_k}\), \(\mu_k\), and \(\sigma^2\) with?
\[\hat{\pi}_k = \frac{n_k}{n}\]
\[\hat{\mu}_k = \frac{1}{n_k}\sum_{i:y_k=k}x_i\]
\[\hat{\sigma}^2 = \frac{1}{n-K}\sum_{k=1}^K\sum_{i:y_i=k}(x_i-\hat{\mu}_k)^2 = \sum_{k=1}^K\frac{n_k-1}{n-K}\cdot \hat{\sigma}_k^2\]
Where \[\hat{\sigma}_k^2 = \frac{1}{n_k-1}\sum_{i:y_i=k}(x_i-\hat{\mu}_k)^2\]
Can we determine the sex of a blue jay by measuring the distance from the tip of the bill to the back of the head (Head)?
What assumptions should we check?
Normality of our predictor
Constant variance between groups.
# A tibble: 2 × 2
  KnownSex Head_sd
  <fct>      <dbl>
1 F           1.27
2 M           1.25Application ExerciseWe are going to use the penguins data from the palmerpeguins package.
Conduct basic EDA to see if penguin bill length is different by sex and if LDA is an appropriate model choice.
Use LDA to predict penguin sex based on their bill length using training and testing data. Get the accuracy on the testing set.
Fit a logistic regression model and get it’s accuracy to see which did better.
When we have 2 or more predictors, the distribution becomes multivariate.
If the covariance between predictors is 0 within each class of the response, LDA is still appropriate.
linear\[\delta_k(x) = c_{k0} + c_{k1}x_1+...c_{kp}x_p\]
There is no limit on the number of levels of the categorical response for LDA
Suppose \(\pi_1=\pi_2\pi_3=1/3\)
 - The dashed lines are known as the 
Bayes decision boundaries
Once the estimates for the \(\hat{\delta}_k(x)\) have been found, we can plug them in and get:
\[\hat{P}(Y=k|X=x)=\frac{e^{\hat{\delta}_k(x)}}{\sum_{l=1}^Ke^{\hat{\delta}^l(x)}}\]
Classifying to the largest \(\hat{\delta}_k(x)\) amounts to classifying to the class for which \(\hat{P}(Y = k|X = x)\) is largest.
When \(K = 2\), we classify to class 2 if \(\hat{P}(Y = 2|X = x)\geq 0.5\), else to class 1.
          Truth
Prediction   No  Yes
       No  2889   83
       Yes    2   26# A tibble: 1 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.972Accuracy of 97.7% on a testing set!
What about the different types of errors?
Of the true yes, we made errors at the rate of 83/(83+26), 76%
Of the true no, we made errors at the rate of 2/(2889+2), .069%
Recall:
False positive rate: The fraction of negative examples that are classified as positive.
False negative rate: The fraction of positive examples that are classified as negative.
Remember the model gave probabilities, the final yes or no is decided by
\[\hat{P}(Default=Yes|Balance,Student) \geq .5\]
          Truth
Prediction   No  Yes
       No  2889   83
       Yes    2   26In order to determine the best threshold, we want to maximize the specificty and sensitivity.
OR - maximize the sensitivty and minimize 1-specificity.
In order to to do this we use an \(ROC\) curve which stands for receiver operating characteristic curve.
We want to maximize the area under this curve, called ROC AUC
# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 roc_auc binary         0.957If we want to tune our model better, we can optimize the ROC AUC by changing the threshold.
I did a train/test approach here. What should I do different if I want to tune for threshold?
QDA arises when \(p>1\) and the is a covariance structure between the predictors within the same level of the response.
This introduces a squared term into the maximization problem of the \(\delta_k(x)\), thus the name.
qda_fit<-discrim_quad() |>
  set_mode("classification")|>
  set_engine("MASS")|> 
  fit(default ~ balance + student,
      data = training(def_splits))
qda_fit |>
  augment(new_data = testing(def_splits)) %>%
  conf_mat(truth = default, estimate = .pred_class)           Truth
Prediction   No  Yes
       No  2888   80
       Yes    3   29# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 roc_auc binary         0.957