Single linkage: distance between two clusters is defined as the minimum distance between any single member of one cluster and any single member of the other cluster.
Complete linkage: distance between two clusters is defined as the maximum distance between any single member of one cluster and any single member of the other cluster.
Average linkage: distance between two clusters is defined as the average distance between all pairs of members from the two clusters.
Ward’s linkage: distance between two clusters is defined based on the increase in the total within-cluster variance that would result from merging the two clusters.
Selection:
Single linkage are most popular among statisticians.
Average and complete linkage are generally preferred over single linkage, as they tend to yield more balanced dendrograms
Different linkage for penguins data
Full penguins data
hclust_df <- penguins_clean[,3:6] |>scale() # compute the distancehclust_dist <- hclust_df |>dist()# compute the hierarchical clusteringhclust_results <-hclust(hclust_dist, method ="average")hclust_results
Call:
hclust(d = hclust_dist, method = "average")
Cluster method : average
Distance : euclidean
Number of objects: 333
The base R function cutree() can be used to cut the dendrogram into a specified number of clusters.
# cut the dendrogram to get 2 clustershclust_clusters <-cutree(hclust_results, k =2)dt <- penguins_clean |>mutate(cluster =as.factor(hclust_clusters)) dt |>count(cluster, species)
# A tibble: 3 × 3
cluster species n
<fct> <fct> <int>
1 1 Adelie 146
2 1 Chinstrap 68
3 2 Gentoo 119
dt |>ggplot(aes(x = bill_dep, y = flipper_len, color = cluster)) +geom_point()