Analyzing whole genome bisulfite sequencing data from highly divergent genotypes
AbstractIn the study of DNA methylation, genetic variation between species, strains, or individuals can result in CpG sites that are exclusive to a subset of samples, and insertions and deletions can rearrange the spatial distribution of CpGs. How to account for this variation in an analysis of the interplay between sequence variation and DNA methylation is not well understood, especially when the number of CpG differences between samples is large. Here we use whole-genome bisulfite sequencing data on two highly divergent inbred mouse strains to study this problem. We find that while the large number of strain-specific CpGs necessitates considerations regarding the reference genomes used during alignment, properties such as CpG density are surprisingly conserved across the genome. We introduce a method for including strain-specific CpGs in differential analysis, and show that accounting for strain-specific CpGs increases the power to find differentially methylated regions between the strains. Our method uses smoothing to impute methylation levels at strain-specific sites, thereby allowing strain-specific CpGs to contribute to the analysis, and also allowing us to account for differences in the spatial occurrences of CpGs. Our results have implications for analysis of genetic variation and DNA methylation using bisulfite-converted DNA.