Chapter 1 and 2
Intro and Chapter 1
👋
Tyler George
MWTh 3:05pm-4:05pm and by appt.
Course Website
Intros
- Name
- Major
- Fun OR boring fact
Statistical Learning Problems
- Identify risk factors for breast cancer
Statistical Learning Problems
- Customize an email spam detection system
- Data: 4601 labeled emails sent to George who works at HP Labs
- Input features: frequencies of words and punctuation
— | george | you | hp | free | ! | edu | remove |
---|---|---|---|---|---|---|---|
spam | 0.00 | 2.26 | 0.02 | 0.52 | 0.51 | 0.01 | 0.28 |
2.27 | 1.27 | 0.90 | 0.07 | 0.11 | 0.29 | 0.01 |
Statistical Learning Problems
- Identify numbers in handwritten zip code
Statistical Learning Problems
Establish the relationship between variables in population survey data
Income survey data for males from the central Atlantic region of US, 2009
Statistical Learning Problems
- Classify pixels of an image
Usage \(\in\) {red soil, cotton, vegetation stubble, mixture, gray soil, damp gray soil}
✌️ types of statistical learning
Supervised Learning
Unsupervised Learning
Supervised Learning
- outcome variable: \(Y\), (dependent variable, response, target)
- predictors: vector of \(p\) predictors, \(X\), (inputs, regressors, covariates, features, independent variables)
- In the regression problem, \(Y\) is quantitative (e.g price, blood pressure)
- In the classification problem, \(Y\) takes values in a finite, unordered set (survived/died, digit 0-9, cancer class of tissue sample)
- We have training data \((x_1, y_1), \dots, (x_N, y_N)\). These are observations (examples, instances) of these measurements
Supervised Learning
What do you think are some objectives here?
Objectives
- Accurately predict unseen test cases
- Understand which inputs affect the outcome, and how
- Assess the quality of our predictions and inferences
Unsupervised Learning
- No outcome variable, just a set of predictors (features) measured on a set of samples
- objective is more fuzzy – find groups of samples that behave similarly, find features that behave similarly, find linear combinations of features with the most variation
- difficult to know how well your are doing
- different from supervised learning, but can be useful as a pre-processing step for supervised learning
Let’s take a tour - class website
- Concepts introduced:
- How to find slides
- How to find assignments
- How to find RStudio
- How to get help
- How to find policies
Chapter 2 - Statistical Learning
Regression and Classification
- Regression: quantitative response
- Classification: qualitative (categorical) response
Regression and Classification
What would be an example of a regression problem?
- Regression: quantitative response
- Classification: qualitative (categorical) response
Regression and Classification
What would be an example of a classification problem?
- Regression: quantitative response
- Classification: qualitative (categorical) response
Regression
Auto data
Above are mpg
vs horsepower
, weight
, and acceleration
, with a blue linear-regression line fit separately to each. Can we predict mpg
using these three?
. . .
Maybe we can do better using a model:
\[\texttt{mpg} \approx f(\texttt{horsepower}, \texttt{weight}, \texttt{acceleration})\]
Notation
mpg
is the response variable, the outcome variable, we refer to this as \(Y\)horsepower
is a feature, input, predictor, we refer to this as \(X_1\)weight
is \(X_2\)acceleration
is \(X_3\)- Our input vector is:
- \(X = \begin{bmatrix} X_1 \\X_2 \\X_3\end{bmatrix}\)
- Our model is
- \(Y = f(X) + \varepsilon\)
- \(\varepsilon\) is our error
Why do we care about \(f(X)\)?
- We can use \(f(X)\) to make predictions of \(Y\) for new values of \(X = x\)
- We can gain a better understanding of which components of \(X = (X_1, X_2, \dots, X_p)\) are important for explaining \(Y\)
- Depending on how complex \(f\) is, maybe we can understand how each component ( \(X_j\) ) of \(X\) affects \(Y\)
How do we choose \(f(X)\)?
What is a good value for \(f(X)\) at any selected value of \(X\), say \(X = 100\)? There can be many \(Y\) values at \(X = 100\).
How do we choose \(f(X)\)?
What is a good value for \(f(X)\) at any selected value of \(X\), say \(X = 100\)? There can be many \(Y\) values at \(X = 100\).
How do we choose \(f(X)\)?
What is a good value for \(f(X)\) at any selected value of \(X\), say \(X = 100\)? There can be many \(Y\) values at \(X = 100\).
- There are 17 points here, what value should I choose for f(100). What do you think the blue dot represents?
How do we choose \(f(X)\)?
A good value is
\[f(100) = E(Y|X = 100)\]
. . .
\(E(Y|X = 100)\) means expected value (average) of \(Y\) given \(X = 100\)
. . .
This ideal \(f(x) = E(Y | X = x)\) is called the regression function
Regression function, \(f(X)\)
- Also works or a vector, \(X\), for example,
\[f(x) = f(x_1, x_2, x_3) = E[Y | X_1 = x_1, X_2 = x_2, X_3 = x_3]\]
- This is the optimal predictor of \(Y\) in terms of mean-squared prediction error
Regression function, \(f(X)\)
\(f(x) = E(Y|X = x)\) is the function that minimizes \(E[(Y - g(X))^2 |X = x]\) over all functions \(g\) at all points \(X = x\)
- \(\varepsilon = Y - f(x)\) is the irreducible error
- even if we knew \(f(x)\), we would still make errors in prediction, since at each \(X = x\) there is typically a distribution of possible \(Y\) values
Regression function, \(f(X)\)
Regression function, \(f(X)\)
Using these points, how would I calculate the regression function?
. . .
- Take the average! \(f(100) = E[\texttt{mpg}|\texttt{horsepower} = 100] = 19.6\)
Regression function, \(f(X)\)
This point has a \(Y\) value of 32.9. What is \(\hat\varepsilon\)?
- \(\hat\varepsilon = Y - \hat{f}(X) = 32.9 - 19.6 = \color{red}{13.3}\)
The error
For any estimate, \(\hat{f}(x)\), of \(f(x)\), we have
\[E[(Y - \hat{f}(x))^2 | X = x] = \underbrace{[f(x) - \hat{f}(x)]^2}_{\textrm{reducible error}} + \underbrace{Var(\varepsilon)}_{\textrm{irreducible error}}\]
- Assume for a moment that both \(\hat{f}\) and X are fixed.
- \(E(Y − \hat{Y})^2\) represents the average, or expected value, of the squared difference between the predicted and actual value of Y, and Var( \(\varepsilon\) ) represents the variance associated with the error term
- The focus of this class is on techniques for estimating f with the aim of minimizing the reducible error.
- the irreducible error will always provide an upper bound on the accuracy of our prediction for Y
- This bound is almost always unknown in practice
Estimating \(f\)
- Typically we have very few (if any!) data points at \(X=x\) exactly, so we cannot compute \(E[Y|X=x]\)
- For example, what if we were interested in estimating miles per gallon when horsepower was 104.
. . .
💡 We can relax the definition and let
\[\hat{f}(x) = E[Y | X\in \mathcal{N}(x)]\]
- Where \(\mathcal{N}(x)\) is some neighborhood of \(x\)
Notation pause!
\[\hat{f}(x) = \underbrace{E}_{\textrm{The expectation}}[\underbrace{Y}_{\textrm{of Y}} \underbrace{|}_{\textrm{given}} \underbrace{X\in \mathcal{N}(x)}_{\textrm{X is in the neighborhood of x}}]\]
. . .
🚨 If you need a notation pause at any point during this class, please let me know!
Estimating \(f\)
💡 We can relax the definition and let
\[\hat{f}(x) = E[Y | X\in \mathcal{N}(x)]\]
- Nearest neighbor averaging does pretty well with small \(p\) ( \(p\leq 4\) ) and large \(n\)
- Nearest neighbor is not great when \(p\) is large because of the curse of dimensionality (because nearest neighbors tend to be far away in high dimensions)
. . .
What do I mean by \(p\)? What do I mean by \(n\)?
Parametric models
A common parametric model is a linear model
\[f(X) = \beta_0 + \beta_1X_1 + \beta_2X_2 + \dots + \beta_pX_p\]
- A linear model has \(p + 1\) parameters ( \(\beta_0,\dots,\beta_p\) )
- We estimate these parameters by fitting a model to training data
- Although this model is almost never correct it can often be a good interpretable approximation to the unknown true function, \(f(X)\)
Let’s look at a simulated example
- The red points are simulated values for
income
from the model:
\[\texttt{income} = f(\texttt{education, senority}) + \varepsilon\]
- \(f\) is the blue surface
Linear regression model fit to the simulated data
\[\hat{f}_L(\texttt{education, senority}) = \hat{\beta}_0 + \hat{\beta}_1\texttt{education}+\hat{\beta}_2\texttt{senority}\]
- More flexible regression model \(\hat{f}_S(\texttt{education, seniority})\) fit to the simulated data.
- Here we use a technique called a thin-plate spline to fit a flexible surface
And even MORE flexible 😱 model \(\hat{f}(\texttt{education, seniority})\).
- Here we’ve basically drawn the surface to hit every point, minimizing the error, but completely overfitting
🤹 Finding balance
- Prediction accuracy versus interpretability
- Linear models are easy to interpret, thin-plate splines are not
- Good fit versus overfit or underfit
- How do we know when the fit is just right?
- Parsimony versus black-box
- We often prefer a simpler model involving fewer variables over a black-box predictor involving them all
Accuracy
We’ve fit a model \(\hat{f}(x)\) to some
training data
.We can measure accuracy as the average squared prediction error over that
train
data
\[MSE_{\texttt{train}} = \textrm{Ave}_{train}[y_i-\hat{f}(x_i)]^2\]
. . .
What can go wrong here?
- This may be biased towards overfit models
Accuracy
I have some train
data, plotted above. What \(\hat{f}(x)\) would minimize the \(MSE_{\texttt{train}}\)?
\[MSE_{\texttt{train}} = \textrm{Ave}_{train}[y_i-\hat{f}(x_i)]^2\]
Accuracy
I have some train
data, plotted above. What \(\hat{f}(x)\) would minimize the \(MSE_{\texttt{train}}\)?
\[MSE_{train} = \textrm{Ave}_{i\in\texttt{train}}[y_i-\hat{f}(x_i)]^2\]
Accuracy
What is wrong with this?
. . .
It’s overfit!
Accuracy
If we get a new sample, that overfit model is probably going to be terrible!
Accuracy
- We’ve fit a model \(\hat{f}(x)\) to some
training data
. - Instead of measuring accuracy as the average squared prediction error over that
train
data, we can compute it using freshtest
data.
\[MSE_{\texttt{test}} = \textrm{Ave}_{test}[y_i-\hat{f}(x_i)]^2\]
Black curve is the “truth” on the left. Red curve on right is \(MSE_{\texttt{test}}\), grey curve is \(MSE_{\texttt{train}}\). Orange, blue and green curves/squares correspond to fis of different flexibility.
Here the truth is smoother, so the smoother fit and linear model do really well
Here the truth is wiggly and the noise is low, so the more flexible fits do the best
Bias-variance trade-off
- We’ve fit a model, \(\hat{f}(x)\), to some training data
- Let’s pull a test observation from this population ( \(x_0, y_0\) )
- The true model is \(Y = f(x) + \varepsilon\)
- \(f(x) = E[Y|X=x]\)
. . .
\[E(y_0 - \hat{f}(x_0))^2 = \textrm{Var}(\hat{f}(x_0)) + [\textrm{Bias}(\hat{f}(x_0))]^2 + \textrm{Var}(\varepsilon)\]
. . .
The expectation averages over the variability of \(y_0\) as well as the variability of the training data. \(\textrm{Bias}(\hat{f}(x_0)) =E[\hat{f}(x_0)]-f(x_0)\)
- As flexibility of \(\hat{f}\) \(\uparrow\), its variance \(\uparrow\) and its bias \(\downarrow\)
- choosing the flexibility based on average test error amounts to a bias-variance trade-off
- That U-shape we see for the test MSE curves is due to this bias-variance trade-off
- The expected test MSE for a given \(x_0\) can be decomposed into three components: the variance of \(\hat{f}(x_o)\), the squared bias of \(\hat{f}(x_o)\) and t4he variance of the error term \(\varepsilon\)
- Here the notation \(E[y_0 − \hat{f}(x_0)]^2\) defines the expected test MSE, and refers to the average test MSE that we would obtain if we repeatedly estimated \(f\) using a large number of training sets, and tested each at \(x_0\)
- The overall expected test MSE can be computed by averaging \(E[y_0 − \hat{f}(x_0)]^2\) over all possible values of \(x_0\) in the test set.
- SO we want to minimize the expected test error, so to do that we need to pick a statistical learning method to simultenously acheive low bias and low variance.
- Since both of these quantities are non-negative, the expected test MSE can never fall below Var( \(\varepsilon\) )
Bias-variance trade-off
Conceptual Idea
Watch StatQuest video: Machine Learning Fundamentals: Bias and Variance
Classification
Notation
- \(Y\) is the response variable. It is qualitative
- \(\mathcal{C}(X)\) is the classifier that assigns a class \(\mathcal{C}\) to some future unlabeled observation, \(X\)
- Examples:
- Email can be classified as \(\mathcal{C}=(\texttt{spam, not spam})\)
- Written number is one of \(\mathcal{C}=\{0, 1, 2, \dots, 9\}\)
Classification Problem
What is the goal?
- Build a classifier \(\mathcal{C}(X)\) that assigns a class label from \(\mathcal{C}\) to a future unlabeled observation \(X\)
- Assess the uncertainty in each classification
- Understand the roles of the different predictors among \(X = (X_1, X_2, \dots, X_p)\)
Suppose there are \(K\) elements in \(\mathcal{C}\), numbered \(1, 2, \dots, K\)
\[p_k(x) = P(Y = k|X=x), k = 1, 2, \dots, K\] These are conditional class probabilities at \(x\)
. . .
How do you think we could calculate this?
. . .
- In the plot, you could examine the mini-barplot at \(x = 5\)
Suppose there are \(K\) elements in \(\mathcal{C}\), numbered \(1, 2, \dots, K\)
\[p_k(x) = P(Y = k|X=x), k = 1, 2, \dots, K\] These are conditional class probabilities at \(x\)
- The Bayes optimal classifier at \(x\) is
\[\mathcal{C}(x) = j \textrm{ if } p_j(x) = \textrm{max}\{p_1(x), p_2(x), \dots, p_K(x)\}\]
- Notice that probability is a conditional probability
- It is the probability that Y equals k given the observed preditor vector, \(x\)
- Let’s say we were using a Bayes Classifier for a two class problem, Y is 1 or 2. We would predict that the class is one if \(P(Y=1|X=x_0)>0.5\) and 2 otherwise
What if this was our data and there were no points at exactly \(x = 5\)? Then how could we calculate this?
- Nearest neighbor like before!
- This does break down as the dimensions grow, but the impact of \(\mathcal{\hat{C}}(x)\) is less than on \(\hat{p}_k(x), k = 1,2,\dots,K\)
Accuracy
- Misclassification error rate
\[Err_{\texttt{test}} = \frac{\#correct predictions}{total predictions} = \textrm{Ave}_{test}I[y_i\neq \mathcal{\hat{C}}(x_i)]\] > * \(I(\cdot)\) is an indicator function and will only be eitehr 0 or 1.
- The Bayes Classifier using the true \(p_k(x)\) has the smallest error
- Some of the methods we (may) learn build structured models for \(\mathcal{C}(x)\) (support vector machines, for example)
- Some build structured models for \(p_k(x)\) (logistic regression, for example)
- the test error rate \(\textrm{Ave}_{i\in\texttt{test}}I[y_i\neq \mathcal{\hat{C}}(x_i)]\) is minimized on average by very simple classifier that assigns each observation to the most likely class, given its predictor values (that’s the Bayes classifier)