Predicting gene expression using DNA methylation in two human populations
Background. DNA methylation, an important epigenetic mark, is well known for its regulatory role in gene expression, especially the negative regulation in the promoter region. However, its correlation with gene expression at population level has not been well studied. In particular, it is unclear if genome-wide DNA methylation profile of an individual can predict her/his gene expression profile. Previous studies were mostly limited to association analyses between single CpG site methylation and gene expression. It is not known whether DNA methylation of a gene has enough prediction power to serve as a surrogate for gene expression in existing human study cohorts with DNA samples but not RNA samples. Results. We studied two human population datasets, Multiple Tissue Human Expression Resource Projects (MuTHER)’s Adipose tissue as well as asthma and normal peoples’ peripheral blood mononuclear cell (PBMC), for predicting gene expression using methylation of all CpG sites from the gene region. Three prediction models were investigated; single linear regression, multiple linear regression, and least absolute shrinkage and selection operator (LASSO) penalized regression. Our results showed that LASSO regression has superior performance among these methods. However, even with LASSO regression, very small prediction R2 was obtained for the majority of genes and only about one thousand genes had prediction R2 greater than 0.1. GO term and pathway analyses of these more predictable genes showed that they are enriched for immune and defense genes. Conclusion. In human populations, DNA methylation of CpG sites at gene region have weak prediction power for gene expression. The relatively more predictable genes tend to be defense and immune genes.