“Blockwise” network analysis of large data

A straightforward weighted correlation network analysis of large data (tens of thousands of nodes or more) is quite memory hungry. Because the analysis uses a correlation or similarity matrix of all nodes, for n network nodes the memory requirement scales as n2. In R, one has to multiply that by 8 bytes for each (double precision) number, and then again by a factor of 2-3 to allow for copying operations the R interpreter needs to do while executing operations on the matrices. When all is said and done, analyzing a set of 20,000 variables or nodes (for example, genome-wide data summarized to genes) requires between 8 and 16 GB of memory; 40,000 nodes (for example, expression data summarized to transcripts) would increase the requirement to 32-64 GB. My personal experience is that 40k transcripts pushes a 32GB memory space very hard. A full network analysis of an Illumina 450k methylation data set with its nearly 500,000 probes would theoretically require some 7 TB of memory.

Are large data and small RAM a no-go for WGCNA? Apart from persuading the account manager to purchase a larger computer or access to one, there are at least two options for tackling large data with WGCNA. The first option is to reduce the number of nodes (variables) in the data, by filtering out uninformative variables or combining (“collapsing”) multiple variables into one. This approach is often effective for gene expression data. But sometimes one simply cannot reduce the number of variables sufficiently without losing too much information. What then? Despair not.

WGCNA implements a trick that allows an approximate analysis of large data sets on computers with modest memory. The trick consists in splitting (“pre-clustering”) the network nodes into blocks such that nodes in different blocks are correlated weakly and that correlation can be neglected. Since the between-block correlations are assumed to be negligible, the network analysis is carried out in each block separately. After modules are identified in all blocks, their eigengenes are calculated and modules with highly correlated eigengenes are merged (possibly across blocks).

This block-by-block (“blockwise”) analysis limits memory requirements to the (squared) size of the largest block. When the largest block is say 1/10th of the number of nodes, the memory requirement goes down by a factor of 100 – that outlandish-looking network analysis of an Illumina 450k data set suddenly becomes doable on a reasonably modern server with say 96GB of RAM. A beneficial “side effect” is that network construction, particularly TOM calculation, also runs much faster. TOM calculation in a block of size nb takes on the order of nb3 operations; if the block sizes are about 1/10th of the number of all nodes, a block-wise TOM calculation will require 10 (n/10)3 = n3/100 operations, or only 1/100 of the time the TOM calculation for data in one large block would take. On the flip side, the pre-clustering also takes some time but is usually faster than a full network analysis would be if it were feasible.

Tips for practical use

The pre-clustering for individual data sets is implemented in function projectiveKMeans. Function consensusProjectiveKMeans handles consensus pre-clustering of multiple data sets. These two functions return a vector of block labels for each variable (node) in the input data; nodes with the same label should be put into one block in subsequent network analyses. Many users will want to use the “one-stop shop” blockwise network analysis functions blockwiseModules and blockwiseConsensusModules (for consensus network analysis). The WGCNA tutorials show the most common use of these two functions.

I emphasize that the blockwise analysis creates an approximation to the network that would result from a single block analysis. The approximation is often very good but the modules are not quite the same. If possible, I recommend running the analysis in a single block; if not, use the largest blocks your computer can handle. Block size in the “blockwise” family of functions is controlled by argument maxBlockSize; for historical reasons, backward compatibility, and reproducibility, the default value is fixed at a relatively low (by today’s standards) value of 5000. The value should be raised as high as possible: 16 GB memory should be able to handle up to about 24,000 nodes; 32 GB should be enough (perhaps barely so) for 40,000 and so on.

Since most analysis steps after module identification use only the module labels but not the full TOM matrix, it does not matter whether the modules were identified in a single block or in multiple blocks. There are a few exceptions though. In standard WGCNA analysis, it is fairly common to plot a gene clustering tree (dendrogram) with module colors and possibly other gene-level information below the dendrogram. Users also sometimes want to plot a heatmap plot of the TOM matrix. These plots can only be created separately for each block; there is no meaningful way to create a single gene dendrogram or a TOM heatmap plot that combines information from multiple blocks.