Logistic Regression
            Cornell College 
 STA 362 Spring 2024 Block 8
          
Setup:
What are some examples of classification problems?
eye color \(\in\) {blue, brown, green}email \(\in\) {spam, not spam}set.seed(1)
Default |>
  sample_frac(size = 0.25) |>
  ggplot(aes(balance, income, color = default)) +
  geom_point(pch = 4) +
  scale_color_manual(values = c("cornflower blue", "red")) +
  theme_classic() +
  theme(legend.position = "top") -> p1
p2 <- ggplot(Default, aes(x = default, y = balance, fill = default)) +
  geom_boxplot() +
  scale_fill_manual(values = c("cornflower blue", "red")) +
  theme_classic() +
  theme(legend.position = "none")
p3 <- ggplot(Default, aes(x = default, y = income, fill = default)) +
  geom_boxplot() +
  scale_fill_manual(values = c("cornflower blue", "red")) +
  theme_classic() +
  theme(legend.position = "none")
grid.arrange(p1, p2, p3, ncol = 3, widths = c(2, 1, 1))We can code Default as
\[Y = \begin{cases} 0 & \textrm{if }\texttt{No}\\ 1&\textrm{if }\texttt{Yes}\end{cases}\] Can we fit a linear regression of \(Y\) on \(X\) and classify as Yes if \(\hat{Y}> 0.5\)?
We can code Default as
\[Y = \begin{cases} 0 & \textrm{if }\texttt{No}\\ 1&\textrm{if }\texttt{Yes}\end{cases}\] Can we fit a linear regression of \(Y\) on \(X\) and classify as Yes if \(\hat{Y}> 0.5\)?
What may do a better job?
Default <- Default |>
  mutate(
  p = glm(default ~ balance, data = Default, family = "binomial") |>
  predict(type = "response"),
  p2 = lm(I(default == "Yes") ~ balance, data = Default) |> predict(),
  def = ifelse(default == "Yes", 1, 0)
)
Default |>
  sample_frac(0.25) |>
ggplot(aes(balance, p2)) +
  geom_hline(yintercept = c(0, 1), lty = 2, size = 0.2) +
  geom_line(color = "cornflower blue") +
  geom_point(aes(balance, def), shape = "|", color = "orange") +
  theme_classic() +
  labs(y = "probability of default") -> p1
Default |>
  sample_frac(0.25) |>
ggplot(aes(balance, p)) +
  geom_hline(yintercept = c(0, 1), lty = 2, size = 0.2) +
  geom_line(color = "cornflower blue") +
  geom_point(aes(balance, def), shape = "|", color = "orange") +
  theme_classic() +
  labs(y = "probability of default") -> p2
grid.arrange(p1, p2, ncol = 2)Which does a better job at predicting the probability of default?
What if we have \(>2\) possible outcomes? For example, someone comes to the emergency room and we need to classify them according to their symptoms
\[ \begin{align} Y = \begin{cases} 1& \textrm{if }\texttt{stroke}\\2&\textrm{if }\texttt{drug overdose}\\3&\textrm{if }\texttt{epileptic seizure}\end{cases} \end{align} \]
What could go wrong here?
stroke and drug overdose is the same as drug overdose and epileptic seizure)What if we have \(>2\) possible outcomes? For example, someone comes to the emergency room and we need to classify them according to their symptoms
\[ \begin{align} Y = \begin{cases} 1& \textrm{if }\texttt{stroke}\\2&\textrm{if }\texttt{drug overdose}\\3&\textrm{if }\texttt{epileptic seizure}\end{cases} \end{align} \]
\[ p(X) = \frac{e^{\beta_0+\beta_1X}}{1+e^{\beta_0+\beta_1X}} \]
\[ p(X) = \frac{e^{\beta_0+\beta_1X}}{1+e^{\beta_0+\beta_1X}} \]
We can rearrange this into the following form:
\[ \log\left(\frac{p(X)}{1-p(X)}\right) = \beta_0 + \beta_1 X \]
What is this transformation called?
Logistic regression ensures that our estimates for \(p(X)\) are between 0 and 1 🎉
Refresher: How did we estimate \(\hat\beta\) in linear regression?
Refresher: How did we estimate \(\hat\beta\) in linear regression?
In logistic regression, we use maximum likelihood to estimate the parameters
\[\mathcal{l}(\beta_0,\beta_1)=\prod_{i:y_i=1}p(x_i)\prod_{i:y_i=0}(1-p(x_i))\]
R do the heavy lifting here# A tibble: 2 × 5
  term         estimate std.error statistic   p.value
  <chr>           <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept) -10.7      0.361        -29.5 3.62e-191
2 balance       0.00550  0.000220      25.0 1.98e-137logistic_reg() function in R with the glm engineWhat is our estimated probability of default for someone with a balance of $1000?
| term | estimate | std.error | statistic | p.value | 
|---|---|---|---|---|
| (Intercept) | -10.6513306 | 0.3611574 | -29.49221 | 0 | 
| balance | 0.0054989 | 0.0002204 | 24.95309 | 0 | 
\[ \hat{p}(X) = \frac{e^{\hat{\beta}_0+\hat{\beta}_1X}}{1+e^{\hat{\beta}_0+\hat{\beta}_1X}}=\frac{e^{-10.65+0.0055\times 1000}}{1+e^{-10.65+0.0055\times 1000}}=0.006 \]
What is our estimated probability of default for someone with a balance of $2000?
| term | estimate | std.error | statistic | p.value | 
|---|---|---|---|---|
| (Intercept) | -10.6513306 | 0.3611574 | -29.49221 | 0 | 
| balance | 0.0054989 | 0.0002204 | 24.95309 | 0 | 
\[ \hat{p}(X) = \frac{e^{\hat{\beta}_0+\hat{\beta}_1X}}{1+e^{\hat{\beta}_0+\hat{\beta}_1X}}=\frac{e^{-10.65+0.0055\times 2000}}{1+e^{-10.65+0.0055\times 2000}}=0.586 \]
Let’s refit the model to predict the probability of default given the customer is a student
| term | estimate | std.error | statistic | p.value | 
|---|---|---|---|---|
| (Intercept) | -3.5041278 | 0.0707130 | -49.554219 | 0.0000000 | 
| studentYes | 0.4048871 | 0.1150188 | 3.520181 | 0.0004313 | 
\[P(\texttt{default = Yes}|\texttt{student = Yes}) = \frac{e^{-3.5041+0.4049\times1}}{1+e^{-3.5041+0.4049\times1}}=0.0431\]
How will this change if student = No?
\[P(\texttt{default = Yes}|\texttt{student = No}) = \frac{e^{-3.5041+0.4049\times0}}{1+e^{-3.5041+0.4049\times0}}=0.0292\]
What is going on here?
How would you get the odds from the log(odds)?
| Form | Model | 
|---|---|
| Logit form | \(\log\left(\frac{\pi}{1-\pi}\right) = \beta_0 + \beta_1x\) | 
| Probability form | \(\Large\pi = \frac{e^{\beta_0 + \beta_1x}}{1+e^{\beta_0 + \beta_1x}}\) | 
| probability | odds | log(odds) | 
|---|---|---|
| \(\pi\) | \(\frac{\pi}{1-\pi}\) | \(\log\left(\frac{\pi}{1-\pi}\right)=l\) | 
⬅️
| log(odds) | odds | probability | 
|---|---|---|
| \(l\) | \(e^l\) | \(\frac{e^l}{1+e^l} = \pi\) | 
A study investigated whether a handheld device that sends a magnetic pulse into a person’s head might be an effective treatment for migraine headaches.
What is the explanatory variable?
A study investigated whether a handheld device that sends a magnetic pulse into a person’s head might be an effective treatment for migraine headaches.
What type of variable is this?
A study investigated whether a handheld device that sends a magnetic pulse into a person’s head might be an effective treatment for migraine headaches.
What is the outcome variable?
A study investigated whether a handheld device that sends a magnetic pulse into a person’s head might be an effective treatment for migraine headaches.
What type of variable is this?
A study investigated whether a handheld device that sends a magnetic pulse into a person’s head might be an effective treatment for migraine headaches.
| Treatment | TMS | Placebo | Total | 
|---|---|---|---|
| Pain-free two hours later | 39 | 22 | 61 | 
| Not pain-free two hours later | 61 | 78 | 139 | 
| Total | 100 | 100 | 200 | 
| Treatment | TMS | Placebo | Total | 
|---|---|---|---|
| Pain-free two hours later | 39 | 22 | 61 | 
| Not pain-free two hours later | 61 | 78 | 139 | 
| Total | 100 | 100 | 200 | 
What if we wanted to calculate this in terms of Not pain-free (with pain-free) as the referent?
| Treatment | TMS | Placebo | Total | 
|---|---|---|---|
| Pain-free two hours later | 39 | 22 | 61 | 
| Not pain-free two hours later | 61 | 78 | 139 | 
| Total | 100 | 100 | 200 | 
What changed here?
| Treatment | TMS | Placebo | Total | 
|---|---|---|---|
| Pain-free two hours later | 39 | 22 | 61 | 
| Not pain-free two hours later | 61 | 78 | 139 | 
| Total | 100 | 100 | 200 | 
In general, it’s more natural to interpret odds ratios > 1, you can flip the referent to do so
| Treatment | TMS | Placebo | Total | 
|---|---|---|---|
| Pain-free two hours later | 39 | 22 | 61 | 
| Not pain-free two hours later | 61 | 78 | 139 | 
| Total | 100 | 100 | 200 | 
\(\Large OR = \frac{78/22}{61/39} = \frac{3.545}{1.564} = 2.27\)
the odds for still being in pain for the placebo group are 2.27 times the odds of being in pain for the TMS group
Let’s look at some Titanic data. We are interested in whether the passenger reported being female is related to whether they survived.
| Female | Male | Total | |
|---|---|---|---|
| Survived | 308 | 142 | 450 | 
| Died | 154 | 709 | 863 | 
| Total | 462 | 851 | 1313 | 
What are the odds of surviving for females versus males?
| Female | Male | Total | |
|---|---|---|---|
| Survived | 308 | 142 | 450 | 
| Died | 154 | 709 | 863 | 
| Total | 462 | 851 | 1313 | 
\[\Large OR = \frac{308/154}{142/709} = \frac{2}{0.2} = 9.99\]
How do you interpret this?
| Female | Male | Total | |
|---|---|---|---|
| Survived | 308 | 142 | 450 | 
| Died | 154 | 709 | 863 | 
| Total | 462 | 851 | 1313 | 
\[\Large OR = \frac{308/154}{142/709} = \frac{2}{0.2} = 9.99\] the odds of surviving for the female passengers was 9.99 times the odds of surviving for the male passengers
What if we wanted to fit a model? What would the equation be?
| Female | Male | Total | |
|---|---|---|---|
| Survived | 308 | 142 | 450 | 
| Died | 154 | 709 | 863 | 
| Total | 462 | 851 | 1313 | 
\[\Large \log(\textrm{odds of survival}) = \beta_0 + \beta_1 \textrm{Female}\]
\[\Large \log(\textrm{odds of survival}) = \beta_0 + \beta_1 \textrm{Female}\]
How do you interpret this result?
How do you interpret this result?
How do you interpret this result?
logistic_reg() |>
  set_engine("glm") |>
  fit(Survived ~ Sex, data = Titanic) |>
  tidy(exponentiate = TRUE) # A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)    0.200    0.0919     -17.5 1.70e-68
2 Sexfemale      9.99     0.135       17.1 2.91e-65[1] 9.99the odds of surviving for the female passengers was 9.99 times the odds of surviving for the male passengers
What if the explanatory variable is continuous?
# A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)   -19.2       5.63     -3.41 0.000644
2 GPA             5.45      1.58      3.45 0.000553A one unit increase in GPA yields a 5.45 increase in the log odds of acceptance
What if the explanatory variable is continuous?
logistic_reg() |>
  set_engine("glm") |>
  fit(Acceptance ~ GPA, data = MedGPA) |>
  tidy(exponentiate = TRUE) # A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)  4.56e-9      5.63     -3.41 0.000644
2 GPA          2.34e+2      1.58      3.45 0.000553A one unit increase in GPA yields a 234-fold increase in the odds of acceptance
How could we get the odds associated with increasing GPA by 0.1?
# A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)   -19.2       5.63     -3.41 0.000644
2 GPA             5.45      1.58      3.45 0.000553A one-tenth unit increase in GPA yields a 1.73-fold increase in the odds of acceptance
How could we get the odds associated with increasing GPA by 0.1?
MedGPA <- MedGPA |>
  mutate(GPA_10 = GPA * 10)
logistic_reg() |>
  set_engine("glm") |>
  fit(Acceptance ~ GPA_10, data = MedGPA) |>
  tidy(exponentiate = TRUE)# A tibble: 2 × 5
  term             estimate std.error statistic  p.value
  <chr>               <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept) 0.00000000456     5.63      -3.41 0.000644
2 GPA_10      1.73              0.158      3.45 0.000553A one-tenth unit increase in GPA yields a 1.73-fold increase in the odds of acceptance
Application ExerciseUsing the Default data from the ISLR package. Fit two logistic regression models that predict whether a customer defaults
student as a predictor. Interpret the coefficient of studentYes.balance as a predictor. Interpret the coefficient of balance.Here is some code to get you started: