Reliability of genomic variants across different next-generation sequencing platforms and bioinformatic processing pipelines
Abstract Background Next Generation Sequencing (NGS) is the fundament of various studies providing insights into questions from biology and medicine. Nevertheless, integrating data from different experimental backgrounds can introduce strong biases. In order to methodically investigate the magnitude of systematic errors, we performed a cross-sectional observational study on a genomic cohort of 99 subjects each sequenced via (i) Illumina HiSeq X, (ii) Illumina HiSeq and (iii) Complete Genomics. Consequently, we systematically analyzed the heterogeneity between the sequencing cohorts with respect to genomic annotation and common filter criteria like minimum allele frequency (MAF). Results The number of detected variants/variant classes per individual was highly dependent on the sequencing technology. We observed a statistically significant overrepresentation of variants uniquely called by a single platform which indicates potential systematic biases. These variants were enriched in low complexity genomic regions and simple repeats. Furthermore, estimates of allele frequency were highly discrepant for a subset of variants in pairwise comparisons between different sequencing platforms. Applying common filters – such as MAF 5% and HWE- greatly reduced the heterogeneity between cohorts but still left discrepancies of several thousand variants after filtering.Conclusion We provide empirical evidence of systematic heterogeneity in variant calls between alternative experimental and data analysis setups. Our results highlight the potential benefit of reprocessing genomic data with harmonized pipelines when integrating data from different studies.