The pre-clustering uses a method inspired by k-means clustering. Standard k-means clustering aims to find the cluster assignments that minimize the sum of distances between the clustered objects (nodes) and cluster centers. The WGCNA preclustering, named projective k-means, maximizes the correlations between nodes and cluster centroids. The cluster centroids are defined as the “eigengenes” (or eigen-nodes) of each cluster, that is as the first singular vectors resulting from Singular Value Decomposition (SVD) of the standardized data matrix of the nodes within the cluster. We chose to maximize the correlations rather than distances because the networks used in WGCNA are based on correlation rather than distances. Defining centroids via SVD ensures that the centroid is the profile with maximal mean square correlation with the cluster members. This is analogous to standard k-means where the geometrical cluster center is the profile that minimizes the average of distances from all cluster members. Projective k-means method accommodates both signed and unsigned networks: For unsigned networks, the method maximizes absolute value of correlations rather than correlations themselves.

Why is (projective) k-means feasible for large data sets? The memory requirement for k-means with \(n\) nodes and \(N_{clusters}\) clusters scales as \(n\times N_{clusters}\). While \(n\) can be large, the number of clusters is rarely very high. One could specify the number of clusters as low as the number of blocks necessary to accommodate the \(n\) nodes into blocks no larger than a given maximum size, but I found that the method works better if the initial number of clusters is larger (say a few hundred) and the initial clusters are iteratively merged into larger blocks such that no block exceeds the required maximum size. This preclustering is better in the sense that nodes with strong connections are more likely to be in the same block. Thus, the WGCNA pre-clustering is really a combination of projective k-means (initial step) and hierarchical clustering (final step).

For consensus analysis of several data sets, WGCNA also implements a consensus version of the projective k-means algorithm where the similarity of a node to a cluster is the average of the similarities in the individual data sets. This is admittedly rather crude since the similarities across different data sets are not calibrated and using a percentile rather than mean can prevent the method from converging, but in practice it seems to work fairly well.

The name “projective k-means” is inspired by the mathematical notion of projective spaces. A projective space is built from a regular \(\)k[\latex]-dimensional space by taking all lines running through the center and treating each line as a single point in the projective space (each line is “projected” to a single point). Absolute correlation is a measure of similarity of such lines since it does not matter which two points along the two lines one picks (except 0), their absolute correlation will be the same. The analogy doesn’t quite work for signed networks though since (signed) correlations roughly correspond to distances along a (unit) sphere, not a projective space.