Hierarchical clustering is a very effective method for exploratory data analysis and is aimed at building a hierarchy of clusters based on the similarity of the samples in a dataset.
The idea behind hierarchical clustering is very intuitive. Let’s assume we are analyzing a dataset containing n samples. We start by assigning each sample to a cluster containing only one element (thus, we have n starting clusters). In the first step of the clustering procedure, the two most similar clusters are merged together. This way, we will now have n-1 clusters (more precisely, n-2 individual clusters and 1 merged cluster). Then, the method proceeds again as before: find the next two closest clusters, merge them, repeat. In the end, starting with n samples in the dataset, there will always be n-1 merges.
In R, a hierarchical clustering analysis may be easily performed calling the hclust function.
#in this example, my_eset is a ExpressionSet-class object
hc <- hclust(dist(t(exprs(my_eset)))
This will return a hclust-class object hc. This object contains all information required to build the hierarchical clustering tree and is a list containing several components. In particular:
- hc$merge is a 2-column matrix that gives information about the merges. Each row of the matrix corresponds to a step of the clustering process. So, for example, the two indexes idx <- hc$merge[x,] are the indexes of the elements merged at step x. If the value of idx[i] is negative, then idx[i] points to the sample having index: (-1)*idx[i]. If the value of idx[i] is positive, then idx[i] points to a cluster merged in a previous step, which is step: hc$merge[idx[i],].
- hc$height is a numeric vector containing the heights of the merges. Of course hc$height[j] refers to the height of the merge j that links together the samples as indicated in hc$merge[j,]
- hc$labels returns the labels of each sample
- hc$order is another vector suggesting a order for the samples so that there won’t be intersections while drawing the hierarchical clustering tree
We can plot a simple basic hierarchical clustering tree using the plot command. Alternatively, we can use the colorhcplot package, available on CRAN. The function colorhcplot requires two arguments: a hclust-class (hc) object and a factor (fac) describing the sample groups within the dataset.
# standard Cluster Dendrogram
plot(hc)
# color Cluster Dendrogram
install.packages(“colorhcplot”)
library(colorhcplot)
colorhcplot(hc, fac)
Features of colorhcplot include:
- sample labels are colored according to the group;
- a legend helps identifying the groups;
- homogenous clusters (i.e. clusters containing samples that belong all to the same group) are connected by colored lines
For drawing a color hierarchical clustering tree, colorhcplot performs the following operations.
It starts by re-ordering all samples and merges according to the hc$order vector.
new_labels <- hc$labels[hc$order]
new_fac <- fac[hc$order]
new_merge <- (-1)*hc$merge
for (n in 1:nrow(new_merge)) {
for (k in 1:ncol(new_merge)){
if (new_merge[n,k] > 0) {
new_merge[n,k] <- which(hc$order == new_merge[n,k])
}
}
}new_merge <- (-1)*new_merge
The next step is to merge clusters in a bottom-up fashion. The function goes through the new_merge matrix and starts drawing the tree. At the same time, it keeps track of the middle point of each cluster-merging segment. As stated before, negative values in the new_merge matrix point to individual clusters while positive values point to a merge completed before.
#assuming a suitable plot has already being plotted
middle_point<-c()
for (i in 1:nrow(new_merge)){
#define x0 and x1 of each merge/segment in a 2-elements vector
x01=c()
for (j in c(1,2)) {
if (new_merge[i,j]<0) {
x01[j] <- (-1)*new_merge[i,j]
} else {
x01[j] <- middle_point[new_merge[i,j]]
}
}
#keep track of the middle point of this merge
middle_point <- c(middle_point, ((x01[1]+x01[2])/2))
#draw the segment
segments(x01[1],hc$height[i],x01[2],hc$height[i],lwd=3)
}
And now we can easily draw the vertical lines
#default hang value in a hc plot is 0.1 = 10% of the total height of the plot. Anyway, hang may be easily adjusted
hang <- 0.1* (max(hc$height)-min(hc$height))
for (i in 1:nrow(new_merge)){
for (j in c(1,2)) {
if ((x<-new_merge[i,j])<0) {
#draw the hang
segments((-1)*x,hc$height[i],(-1)*x,(hc$height[i])-hang,lwd=3)
} else {
segments(middle_point[x],hc$height[i],middle_point[x],hc$height[x],lwd=3)
}
}
}
The hierarchical clustering tree is complete. We can modify the code above to introduce colors.