Weighted correlation analysis in general and WGCNA in particular can be applied to many problems and data sets, but certainly not to all. To set the terminology straight, recall that, in a correlation network, the each node represents a variable (feature), and links represent, possibly transformed, correlations among the variables. Although one could construct the network just for the sake of the network, more commonly the networks are used to study which variables (and which groups or “modules” of variables) are likely to be important for a property of the system that is represented by the network.
Correlation network analysis makes a few important assumptions about the system and properties one wants to study:
- Collective behaviour of multiple nodes in the network should be relevant for the property one wants to study. This is in some ways the most important assumption. Network analysis is about studying the collective behaviour and interplay of nodes; if these are irrelevant for the property one is interested in, network analysis will be of little help.
- Correlations should reflect functional relationships, at least to some degree. Correlation networks are based on correlations of the network nodes (variables); it is assumed that these correlations at least partially reflect functional relationships. For example, in gene co-expression and co-methylation networks, functionally related genes are often but not always correlated. However, correlation can also be caused by technical artifacts such as batch effects or poor normalization. For certain data types, correlations reflect inherent relationships but those relationships are not interesting — for example, genotype SNP markers are usually strongly correlated with other nearby SNPs because of linkage disequilibrium, not because of functional relationships.
- Functional relationships reflected in the correlations should be relevant for the property one wants to study. This may seem a bit redundant, but bear with me. Since correlation networks are (usually) constructed from data without regard to a particular property (technically, they are built in an unsupervised manner), the correlations will reflect the largest drivers of variability in the network node variables. If these drivers are unrelated to the property, the network analysis may find beautiful, functionally coherent and meaningful network modules that are nevertheless entirely useless for the studied property. A somewhat contrived example would be a gene expression study of a disease across several different tissues (say liver, muscle and fat tissue). Were one to combine the data from different tissues into a single data set and run a network analysis on it, the modules will mainly correspond to different tissues or perhaps major cell types. This makes biological sense and may even allow to classify previously unstudied genes in terms of their expression across tissues, but will likely provide no information about the disease.
- Calculating correlation on the data should make sense. (I know, sounds obvious.) Pearson correlation and other related measures (e.g., robust modifications) work well on data that can at least approximately be thought of as continuous with a symmetric distribution that is not too heavy-tailed compared to the normal. An example of data on which it does not work well are binary variables, low counts (when most counts are at most 3 or so), and especially so sparse counts (when most counts are 0).
In addition, as with most other data analysis methods, one needs a reasonable amount of reasonably clean data. Network analysis and WGCNA are no magic wands; if the data contain a lot of technical artifacts or noise, the results will not be useful*. Because network analysis is unsupervised, it is important that known and inferred sources of unwanted variation be adjusted for. Although effects of outliers can be minimized through the use of robust correlation calculations, it is better to remove samples that are clear outliers.
How many samples are a “reasonable amount”? If possible, at least 30 samples, ideally at least 50, assuming the relevant signal in the data is strong enough that the number of samples can be expected to yield significant findings. At the low end, I would not spend time analyzing a data set with less than 10 samples in WGCNA. Anything less than 15 samples is also not likely to yield deep insights, although, depending on the experimental design, WGCNA may yield more robust and interpretable results than an analysis of individual differential expression.
*The GIGO principle applies: Garbage In, Garbage Out.
This post collects a few links to WGCNA-related material posted elsewhere on the web.