Chapter 12 - Unsupervised Learning

Clustering

Tyler George

Cornell College
STA 362 Spring 2024 Block 8

Setup

library(tidymodels)
library(tidyverse)
library(palmerpenguins)
data(penguins)
library(tidyclust)
#install.packages('tidyclust')
#install.packages('factoextra')

Topics

Unsupervised learning
k-means clustering
k-medoids clustering
distance metrics
hierarchical clustering

Unsupervised learning

Grouping or categorizing observational units (objects) without any pre-assigned labels or scores (no outcome information!)

Some examples:

Latent Dirichlet Allocation: Topic Modeling of TSL Articles
Network Analysis: Political Books
Network & Clustering: Characters in ‘Love Actually’

\(k\)-means Clustering

\(k\)-means clustering is an unsupervised partitioning algorithm designed to find a partition of the observations such that the following objective function is minimized (find the smallest within cluster sum of squares):

\[\text{arg}\,\min\limits_{C_1, \ldots, C_k} \Bigg\{ \sum_{k=1}^K \sum_{i \in C_k} \sum_{j=1}^p (x_{ij} - \overline{x}_{kj})^2 \Bigg\}\]

Monsters clustering

Monsters as cluster centers moving around throughout the k-means algorithm.

Artwork by @allison_horst.

A fun applet!!

https://www.naftaliharris.com/blog/visualizing-k-means-clustering/

Algorithm: \(k\)-Means Clustering

Randomly assign a number, from 1 to \(k\), to each of the observations. These serve as initial cluster assignments for the observations.

Iterate until the cluster assignments stop changing:
1. For each of the \(k\) clusters, compute the cluster centroid. The \(k^{th}\) cluster centroid is the vector of the \(p\) feature means for the observations in the \(k^{th}\) cluster.

(b) Assign each observation to the cluster whose centroid is closest (where closest is defined using Euclidean distance).

Ties? Do something consistent: for example, leave in the current cluster.

Convergence? Yes! (local…)

If a point is “closer” to a different center, moving it will lower the objective function.
Averages minimize squared differences, so taking the new average will result in a lower objective function.
If a point is equidistant from two clusters, the point won’t move.
The algorithm must converge in finite number of steps because there are finitely many points.

norm_clust %>%
  kmeans(centers = 2) %>%
  augment(norm_clust) %>%
  ggplot() + 
  geom_point(aes(x = x1, 
                 y = x2, 
                 color = .cluster)) +
  ggtitle("k-means (k=2) on raw data")

norm_clust %>%
  mutate(across(everything(), 
                scale)) %>%
  kmeans(centers = 2) %>%
  augment(norm_clust) %>%
  ggplot() + 
  geom_point(aes(x = x1, 
                 y = x2, 
                 color = .cluster)) +
  ggtitle("k-means (k=2) on raw data")

K-means tidymodels

penguins_recipe <- recipe(~.,data = penguins) |>
  update_role(all_predictors(), new_role = "predictor") |>
  step_select(bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g)|>
  step_naomit(all_predictors()) |>
  step_scale(all_predictors()) 

kmeans_spec <- k_means(num_clusters =3)

# Create a workflow
wf <- workflow() |>
  add_recipe(penguins_recipe) |>
  add_model(kmeans_spec) #no engine since base R

# Train the model
km_fit <- wf |>
  fit(data = penguins)

k-means outputs

extract_cluster_assignment(km_fit) |>
  head()

# A tibble: 6 × 1
  .cluster 
  <fct>    
1 Cluster_1
2 Cluster_1
3 Cluster_1
4 Cluster_1
5 Cluster_1
6 Cluster_1

extract_centroids(km_fit)

# A tibble: 3 × 5
  .cluster  bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>              <dbl>         <dbl>             <dbl>       <dbl>
1 Cluster_1           7.00          9.17              13.4        4.47
2 Cluster_2           8.70          9.50              14.0        4.87
3 Cluster_3           8.70          7.59              15.4        6.33

k-means outputs

km_summary<- extract_fit_summary(km_fit)
km_summary

$cluster_names
[1] Cluster_1 Cluster_2 Cluster_3
Levels: Cluster_1 Cluster_2 Cluster_3

$centroids
# A tibble: 3 × 4
  bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
           <dbl>         <dbl>             <dbl>       <dbl>
1           7.00          9.17              13.4        4.47
2           8.70          9.50              14.0        4.87
3           8.70          7.59              15.4        6.33

$n_members
[1] 132  87 123

$sse_within_total_total
[1] 122.1477 112.9852 143.1502

$sse_total
[1] 1364

$orig_labels
  [1] 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [43] 2 1 1 1 1 1 2 1 1 1 2 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 2 1 1 1 2 1 2 1 1 1 2 1 2 1 1 1
 [85] 1 1 1 1 1 1 2 1 1 1 2 1 1 1 2 1 2 1 1 1 1 1 1 1 2 1 2 1 2 1 2 1 1 1 1 1 1 1 2 1 1 1
[127] 1 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
[169] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
[211] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
[253] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[295] 1 2 1 2 2 2 2 2 2 2 1 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2
[337] 2 2 2 2 2 2

$cluster_assignments
  [1] Cluster_1 Cluster_1 Cluster_1 Cluster_1 Cluster_1 Cluster_1 Cluster_1 Cluster_1
  [9] Cluster_2 Cluster_1 Cluster_1 Cluster_1 Cluster_1 Cluster_1 Cluster_1 Cluster_1
 [17] Cluster_2 Cluster_1 Cluster_2 Cluster_1 Cluster_1 Cluster_1 Cluster_1 Cluster_1
 [25] Cluster_1 Cluster_1 Cluster_1 Cluster_1 Cluster_1 Cluster_1 Cluster_1 Cluster_1
 [33] Cluster_1 Cluster_1 Cluster_1 Cluster_1 Cluster_1 Cluster_1 Cluster_1 Cluster_1
 [41] Cluster_1 Cluster_1 Cluster_2 Cluster_1 Cluster_1 Cluster_1 Cluster_1 Cluster_1
 [49] Cluster_2 Cluster_1 Cluster_1 Cluster_1 Cluster_2 Cluster_1 Cluster_1 Cluster_1
 [57] Cluster_1 Cluster_1 Cluster_1 Cluster_1 Cluster_2 Cluster_1 Cluster_1 Cluster_1
 [65] Cluster_1 Cluster_1 Cluster_1 Cluster_1 Cluster_2 Cluster_1 Cluster_1 Cluster_1
 [73] Cluster_2 Cluster_1 Cluster_2 Cluster_1 Cluster_1 Cluster_1 Cluster_2 Cluster_1
 [81] Cluster_2 Cluster_1 Cluster_1 Cluster_1 Cluster_1 Cluster_1 Cluster_1 Cluster_1
 [89] Cluster_1 Cluster_1 Cluster_2 Cluster_1 Cluster_1 Cluster_1 Cluster_2 Cluster_1
 [97] Cluster_1 Cluster_1 Cluster_2 Cluster_1 Cluster_2 Cluster_1 Cluster_1 Cluster_1
[105] Cluster_1 Cluster_1 Cluster_1 Cluster_1 Cluster_2 Cluster_1 Cluster_2 Cluster_1
[113] Cluster_2 Cluster_1 Cluster_2 Cluster_1 Cluster_1 Cluster_1 Cluster_1 Cluster_1
[121] Cluster_1 Cluster_1 Cluster_2 Cluster_1 Cluster_1 Cluster_1 Cluster_1 Cluster_1
[129] Cluster_2 Cluster_1 Cluster_2 Cluster_1 Cluster_1 Cluster_1 Cluster_1 Cluster_1
[137] Cluster_1 Cluster_1 Cluster_1 Cluster_1 Cluster_1 Cluster_1 Cluster_1 Cluster_1
[145] Cluster_1 Cluster_1 Cluster_1 Cluster_1 Cluster_1 Cluster_1 Cluster_2 Cluster_3
[153] Cluster_3 Cluster_3 Cluster_3 Cluster_3 Cluster_3 Cluster_3 Cluster_3 Cluster_3
[161] Cluster_3 Cluster_3 Cluster_3 Cluster_3 Cluster_3 Cluster_3 Cluster_3 Cluster_3
[169] Cluster_3 Cluster_3 Cluster_3 Cluster_3 Cluster_3 Cluster_3 Cluster_3 Cluster_3
[177] Cluster_3 Cluster_3 Cluster_3 Cluster_3 Cluster_3 Cluster_3 Cluster_3 Cluster_3
[185] Cluster_3 Cluster_3 Cluster_3 Cluster_3 Cluster_3 Cluster_3 Cluster_3 Cluster_3
[193] Cluster_3 Cluster_3 Cluster_3 Cluster_3 Cluster_3 Cluster_3 Cluster_3 Cluster_3
[201] Cluster_3 Cluster_3 Cluster_3 Cluster_3 Cluster_3 Cluster_3 Cluster_3 Cluster_3
[209] Cluster_3 Cluster_3 Cluster_3 Cluster_3 Cluster_3 Cluster_3 Cluster_3 Cluster_3
[217] Cluster_3 Cluster_3 Cluster_3 Cluster_3 Cluster_3 Cluster_3 Cluster_3 Cluster_3
[225] Cluster_3 Cluster_3 Cluster_3 Cluster_3 Cluster_3 Cluster_3 Cluster_3 Cluster_3
[233] Cluster_3 Cluster_3 Cluster_3 Cluster_3 Cluster_3 Cluster_3 Cluster_3 Cluster_3
[241] Cluster_3 Cluster_3 Cluster_3 Cluster_3 Cluster_3 Cluster_3 Cluster_3 Cluster_3
[249] Cluster_3 Cluster_3 Cluster_3 Cluster_3 Cluster_3 Cluster_3 Cluster_3 Cluster_3
[257] Cluster_3 Cluster_3 Cluster_3 Cluster_3 Cluster_3 Cluster_3 Cluster_3 Cluster_3
[265] Cluster_3 Cluster_3 Cluster_3 Cluster_3 Cluster_3 Cluster_3 Cluster_3 Cluster_3
[273] Cluster_3 Cluster_3 Cluster_2 Cluster_2 Cluster_2 Cluster_2 Cluster_2 Cluster_2
[281] Cluster_2 Cluster_2 Cluster_2 Cluster_2 Cluster_2 Cluster_2 Cluster_2 Cluster_2
[289] Cluster_2 Cluster_2 Cluster_2 Cluster_2 Cluster_2 Cluster_2 Cluster_1 Cluster_2
[297] Cluster_1 Cluster_2 Cluster_2 Cluster_2 Cluster_2 Cluster_2 Cluster_2 Cluster_2
[305] Cluster_1 Cluster_2 Cluster_1 Cluster_2 Cluster_2 Cluster_2 Cluster_2 Cluster_2
[313] Cluster_2 Cluster_2 Cluster_2 Cluster_2 Cluster_2 Cluster_2 Cluster_2 Cluster_2
[321] Cluster_2 Cluster_2 Cluster_2 Cluster_2 Cluster_2 Cluster_2 Cluster_2 Cluster_2
[329] Cluster_1 Cluster_2 Cluster_2 Cluster_2 Cluster_2 Cluster_2 Cluster_2 Cluster_2
[337] Cluster_2 Cluster_2 Cluster_2 Cluster_2 Cluster_2 Cluster_2
Levels: Cluster_1 Cluster_2 Cluster_3

Measuring k-means fit

Sum of squared error

One simple metric is the within cluster sum-of-squared error (WSS), which measures the sum of all distances from observations to their cluster center.
This is sometimes scaled with the total sum-of-squared error (TSS), the distance from all observations to the global centroid; in particular, the ratio WSS/TSS is often computed.
Small values of WSS or of the WSS/TSS ratio suggest that the observations within clusters are closer (more similar) to each other than they are to the other clusters.
The WSS and TSS come “for free” with the model fit summary, or they can be accessed directly from the model fit.

WSS

Note we use the fit object, not summary.

km_fit |> sse_within_total()

# A tibble: 1 × 3
  .metric          .estimator .estimate
  <chr>            <chr>          <dbl>
1 sse_within_total standard        378.

km_fit |> sse_total()

# A tibble: 1 × 3
  .metric   .estimator .estimate
  <chr>     <chr>          <dbl>
1 sse_total standard        1364

km_fit |> sse_ratio()

# A tibble: 1 × 3
  .metric   .estimator .estimate
  <chr>     <chr>          <dbl>
1 sse_ratio standard       0.277

km_fit |> sse_within()

# A tibble: 3 × 3
  .cluster    wss n_members
  <fct>     <dbl>     <int>
1 Cluster_1  122.       132
2 Cluster_2  113.        87
3 Cluster_3  143.       123

Evaluating clustering (which \(k\)?)

Silhouette Width (use \(k\) with smallest silhouette width)
Elbow plot (use \(k\) at elbow on plot of \(k\) vs. within cluster sum of squares)
The silhouette of a single observation is proportional to the average distance from that observation to within-cluster observations minus the average distance to outside-cluster observations; normalized by the greater of these two average.
A large silhouette (close to 1) suggests that an observation is more similar to those within its cluster than those outside its cluster.
We can average all silhouettes to get a metric for the full clustering fit.

Silhouette width

Consider observation \(i \in\) cluster \(C_1\). Let

\[d(i, C_k) = \mbox{average dissimilarity of } i \mbox{ to all objects in cluster } C_k\] \[a(i) = \mbox{average dissimilarity of } i \mbox{ to all objects in } C_1.\] \[b(i) = \min_{C_k \ne C_1} d(i,C_k) = \mbox{distance to the next closest neighbor cluster}\] \[s(i) = \frac{b(i) - a(i)}{\max \{ a(i), b(i) \}}\] \[\mbox{average}_{i \in C_1} s(i) = \mbox{average silhouette width for cluster } C_1\]

Note that if \(a(i) < b(i)\) then \(i\) is well classified with a maximum \(s(i) = 1\). If \(a(i) > b(i)\) then \(i\) is not well classified with a maximum \(s(i) = -1\).

Silhouette width - tidymodels

The computation of the silhouette depends on the original observation values, a dataset must also be supplied to the function.

# only needed because I didn't split and handle NAs before
penguins_na_drop <- penguins |> drop_na(bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g) 

    
km_fit %>%
  silhouette_avg(penguins_na_drop )

# A tibble: 1 × 3
  .metric        .estimator .estimate
  <chr>          <chr>          <dbl>
1 silhouette_avg standard       0.447

K Means Clustering, find K

pen_recipe2 <- recipe(~.,data = penguins_na_drop) |>
  update_role(all_predictors(), new_role = "predictor") |>
  step_select(bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g)|>
  step_scale(all_predictors()) 

pen_cv <- vfold_cv(penguins_na_drop,5)

km_spec2 <- k_means(num_clusters = tune())

clust_num_grid <- grid_regular(num_clusters(),levels = 10)

wf_tune <- workflow()|>
  add_recipe(pen_recipe2)|>
  add_model(km_spec2)
  
res <- tune_cluster(
  wf_tune,
  resamples = pen_cv,
  grid = clust_num_grid,
  control = control_grid(save_pred = TRUE, extract = identity),
  metrics = cluster_metric_set(sse_within_total, sse_total, sse_ratio, silhouette_avg))

K Means Clustering, find K

res_metrics <- res %>% collect_metrics()
res_metrics

# A tibble: 40 × 7
   num_clusters .metric          .estimator     mean     n  std_err .config              
          <int> <chr>            <chr>         <dbl> <int>    <dbl> <chr>                
 1            1 silhouette_avg   standard    NaN         0 NA       Preprocessor1_Model01
 2            1 sse_ratio        standard      1         5  0       Preprocessor1_Model01
 3            1 sse_total        standard   1090.        5  0.980   Preprocessor1_Model01
 4            1 sse_within_total standard   1090.        5  0.980   Preprocessor1_Model01
 5            2 silhouette_avg   standard      0.531     5  0.00325 Preprocessor1_Model02
 6            2 sse_ratio        standard      0.413     5  0.00431 Preprocessor1_Model02
 7            2 sse_total        standard   1090.        5  0.980   Preprocessor1_Model02
 8            2 sse_within_total standard    451.        5  4.46    Preprocessor1_Model02
 9            3 silhouette_avg   standard      0.440     5  0.00431 Preprocessor1_Model03
10            3 sse_ratio        standard      0.341     5  0.0177  Preprocessor1_Model03
# ℹ 30 more rows

K Means Clustering, find K

res_metrics %>%
  filter(.metric == "silhouette_avg") %>%
  ggplot(aes(x = num_clusters, y = mean)) +
  geom_point() +
  geom_line() +
  theme_minimal() +
  ylab("silhouette_avg") +
  xlab("Number of clusters") +
  scale_x_continuous(breaks = 1:10)

Distance metric (mathematically)

\(d({\bf x}, {\bf y}) \geq 0\)
\(d({\bf x}, {\bf y}) = d({\bf y}, {\bf x})\)
\(d({\bf x}, {\bf y}) = 0\) iff \({\bf x} = {\bf y}\)
\(d({\bf x}, {\bf y}) \leq d({\bf x}, {\bf z}) + d({\bf z}, {\bf y})\) for all other vectors \({\bf z}\).

Distance measures (clustering)

Euclidean Distance

\[d_E({\bf x}, {\bf y}) = \sqrt{\sum_{i=1}^p (x_i - y_i)^2}\]

Pearson Correlation Distance

\[d_P({\bf x}, {\bf y}) = 1 - r_P ({\bf x}, {\bf y})\] \[\mbox{ or } d_P({\bf x}, {\bf y}) = 1 - |r_P ({\bf x}, {\bf y})|\] \[\mbox{ or } d_P({\bf x}, {\bf y}) = 1 - (r_P ({\bf x}, {\bf y}))^2\]

Correlation distance isn’t a distance metric!

x1 <- c(1,2,3)
x2 <- c(1, 4, 10)
x3 <- c(9, 2, 2)

# d(1,2)
1 - cor(x1, x2)

[1] 0.01801949

# d(1,3)
1 - cor(x1, x3)

[1] 1.866025

# d(2,3)
1 - cor(x2, x3)

[1] 1.755929

# d(1,3) > d(1,2) + d(2,3)
1 - cor(x1, x2) + 1 - cor(x2, x3)

[1] 1.773948

Correlation distance isn’t a distance metric!

Using absolute distance doesn’t fix things.

# d(1,2)
1 - abs(cor(x1, x2))

[1] 0.01801949

# d(1,3)
1 - abs(cor(x1, x3))

[1] 0.1339746

# d(2,3)
1 - abs(cor(x2, x3))

[1] 0.2440711

# d(2,3) > d(1,2) + d(1,3)
1 - abs(cor(x1, x2)) + 1 - abs(cor(x1, x3))

[1] 0.1519941

Cosine Distance (for clustering)

\[d_C({\bf x}, {\bf y}) = \frac{{\bf x} \cdot {\bf y}}{|| {\bf x} || ||{\bf y}||}\] \[= \frac{\sum_{i=1}^p x_i y_i}{\sqrt{\sum_{i=1}^p x_i^2 \sum_{i=1}^p y_i^2}}\] \[= 1 - r_P ({\bf x}, {\bf y}) \ \ \ \ \mbox{if } \overline{\bf x} = \overline{\bf y} = 0\] * Hamming Distance

\[\begin{align} d_H({\bf x}, {\bf y}) = \sum_{i=1}^p I(x_i \ne y_i) \end{align}\]

The Hamming distance across the two DNA strands is 7.

`dist` function in R

The function dist in R calculates the distances given above.

String distances

https://www.kdnuggets.com/2019/01/comparison-text-distance-metrics.html

Comparison of string distance metrics from https://www.kdnuggets.com/2019/01/comparison-text-distance-metrics.html.

k-Medoids

Shortcomings of k-means

Center is calculated as average (establishes Euclidean distance)
Because center changes, distances must be re-calculated
Really, only Euclidean distance makes sense

Partitioning Around Medoids (PAM)

Find the observations (data values!) \(m_k\) that solve:

\[\text{arg}\,\min\limits_{C_1, \ldots, C_k} \Bigg\{ \sum_{k=1}^K \sum_{i \in C_k}d(x_i, m_k) \Bigg\}\]

Not well implemented in R

Hierarchical Clustering

is a set of nested clusters that are organized as a tree. Note that objects that belong to a child cluster also belong to the parent cluster.

Stat Quest: https://www.youtube.com/watch?v=7xHsRkOdVwo

Algorithm: Agglomerative Hierarchical Clustering Algorithm

Begin with \(n\) observations and a measure (such as Euclidean distance) of all the \({n \choose 2} = n(n-1)/2\) pairwise dissimilarities. Treat each observation as its own cluster.

For \(i = n, n - 1, \ldots , 2\):
1. Examine all pairwise inter-cluster dissimilarities among the \(i\) clusters and identify the pair of clusters that are least dissimilar (that is, most similar). Fuse these two clusters. The dissimilarity between these two clusters indicates the height in the dendrogram at which the fusion should be placed.

Compute the new pairwise inter-cluster dissimilarities among the \(i - 1\) remaining clusters.

Definitions

Agglomerative methods start with each object (e.g., gene, penguin, etc.) in its own group. Groups are merged until all objects are together in one group.

Divisive methods start with all objects in one group and break up the groups sequentially until all objects are individuals.

Single Linkage algorithm defines the distance between groups as that of the closest pair of individuals.

Complete Linkage algorithm defines the distance between groups as that of the farthest pair of individuals.

Average Linkage algorithm defines the distance between groups as the average of the distances between all pairs of individuals across the groups.

Hierarchical clustering in R - tidymodels

# Define a hierarchical clustering model
hc_spec <- hier_clust(linkage_method = "complete",
                      num_clusters = 3)

# Create a workflow
workflow <- workflow() |>
  add_recipe(pen_recipe2) |>
  add_model(hc_spec)

# fit
hc_fit <- workflow |>
  fit(data = penguins_na_drop)

Output

hc_summary <- hc_fit %>% extract_fit_summary()
hc_summary %>% str()

List of 7
 $ cluster_names         : Factor w/ 3 levels "Cluster_1","Cluster_2",..: 1 2 3
 $ centroids             : tibble [3 × 4] (S3: tbl_df/tbl/data.frame)
  ..$ bill_length_mm   : num [1:3] 7.2 8.7 9.15
  ..$ bill_depth_mm    : num [1:3] 9.27 7.59 9.4
  ..$ flipper_length_mm: num [1:3] 13.5 15.4 14
  ..$ body_mass_g      : num [1:3] 4.6 6.33 4.71
 $ n_members             : int [1:3] 165 123 54
 $ sse_within_total_total: num [1:3] 169.5 120.9 49.9
 $ sse_total             : num 656
 $ orig_labels           : NULL
 $ cluster_assignments   : Factor w/ 3 levels "Cluster_1","Cluster_2",..: 1 1 1 1 1 1 1 1 1 1 ...

# This depends on our linkage chosen
hc_preds <- hc_fit %>% predict(penguins_na_drop)
hc_preds

# A tibble: 342 × 1
   .pred_cluster
   <fct>        
 1 Cluster_1    
 2 Cluster_1    
 3 Cluster_1    
 4 Cluster_1    
 5 Cluster_1    
 6 Cluster_1    
 7 Cluster_1    
 8 Cluster_1    
 9 Cluster_3    
10 Cluster_1    
# ℹ 332 more rows

hc_fit  |> extract_cluster_assignment(num_clusters = 3)

# A tibble: 342 × 1
   .cluster 
   <fct>    
 1 Cluster_1
 2 Cluster_1
 3 Cluster_1
 4 Cluster_1
 5 Cluster_1
 6 Cluster_1
 7 Cluster_1
 8 Cluster_1
 9 Cluster_1
10 Cluster_1
# ℹ 332 more rows

hc_preds <- hc_fit |> augment(penguins_na_drop)
head(hc_preds)

# A tibble: 6 × 9
  .pred_cluster species island  bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>         <fct>   <fct>            <dbl>         <dbl>             <int>       <int>
1 Cluster_1     Adelie  Torger…           39.1          18.7               181        3750
2 Cluster_1     Adelie  Torger…           39.5          17.4               186        3800
3 Cluster_1     Adelie  Torger…           40.3          18                 195        3250
4 Cluster_1     Adelie  Torger…           36.7          19.3               193        3450
5 Cluster_1     Adelie  Torger…           39.3          20.6               190        3650
6 Cluster_1     Adelie  Torger…           38.9          17.8               181        3625
# ℹ 2 more variables: sex <fct>, year <int>

Graph

library(factoextra)

hc_fit  %>%
  extract_fit_engine() %>%
  fviz_dend(main = "complete", k = 3)

`Clustering`

Perform Kmeans and Hierarchical Clustering on the iris dataset (use data("iris"))

Chapter 12 - Unsupervised Learning

Setup

Topics

Unsupervised learning

Some examples:

\(k\)-means Clustering

Monsters clustering

A fun applet!!

Algorithm: \(k\)-Means Clustering

Convergence? Yes! (local…)

Scaling

K-means tidymodels

k-means outputs

k-means outputs

Measuring k-means fit

WSS

Evaluating clustering (which \(k\)?)

Silhouette width

Silhouette width - tidymodels

K Means Clustering, find K

K Means Clustering, find K

K Means Clustering, find K

Distance metric (mathematically)

Distance measures (clustering)

Correlation distance isn’t a distance metric!

Correlation distance isn’t a distance metric!

Cosine Distance (for clustering)

dist function in R

String distances

k-Medoids

Partitioning Around Medoids (PAM)

Hierarchical Clustering

Algorithm: Agglomerative Hierarchical Clustering Algorithm

Definitions

Hierarchical clustering in R - tidymodels

Output

Graph

Clustering

`dist` function in R

`Clustering`