Academic Integrity: tutoring, explanations, and feedback — we don’t complete graded work or submit on a student’s behalf.

please prove that if these two methods of computing cluster distance are equival

ID: 3738822 • Letter: P

Question

please prove that if these two methods of computing cluster distance are equivalent:

i. Calculate the cluster centroids (mean point of all points in a cluster) and merge the two clusters with the closest centroids

ii. Calculate the average pairwise distance between all points across pairs of clusters, and merge those clusters with the smallest average pairwise distance.

If the two are equivalent, prove it mathematically. If the two are different, supply a (simple) case where the two measures will select different clusters to merge, give the average pairwise distances and cluster centroid distances, and show a plot of your constructed dataset.

(you can choose your own set of clusters for proving)

(The answer should be 2 methods are different, I just want to know how we get that )

Explanation / Answer

Single-Link, Complete-Link & Average-Link Clustering

Hierarchical clustering treats each data point as a singleton cluster, and then successively merges clusters until all points have been merged into a single remaining cluster. A hierarchical clustering is often represented as a dendrogram (from Manning et al. 1999).

In complete-link (or complete linkage) hierarchical clustering, we merge in each step the two clusters whose merger has the smallest diameter (or: the two clusters with the smallest maximum pairwise distance).

In single-link (or single linkage) hierarchical clustering, we merge in each step the two clusters whose two closest members have the smallest distance (or: the two clusters with the smallest minimum pairwise distance).

Complete-link clustering can also be described using the concept of clique. Let dn be the diameter of the cluster created in step n of complete-link clustering. Define graph G(n) as the graph that links all data points with a distance of at most dn. Then the clusters after step n are the cliques of G(n). This motivates the term complete-link clustering.

Single-link clustering can also be described in graph theoretical terms. If dn is the distance of the two clusters merged in step n, and G(n) is the graph that links all data points with a distance of at most dn, then the clusters after step n are the connected components of G(n). A single-link clustering also closely corresponds to a weighted graph's minimum spanning tree.