Reliability of genomic variants across different next-generation sequencing platforms and bioinformatic processing pipelines

Abstract Background Next Generation Sequencing (NGS) is the fundament of various studies providing insights into questions from biology and medicine. Nevertheless, integrating data from different experimental backgrounds can introduce strong biases. In order to methodically investigate the magnitude of systematic errors, we performed a cross-sectional observational study on a genomic cohort of 99 subjects each sequenced via (i) Illumina HiSeq X, (ii) Illumina HiSeq and (iii) Complete Genomics. Consequently, we systematically analyzed the heterogeneity between the sequencing cohorts with respect to genomic annotation and common filter criteria like minimum allele frequency (MAF). Results The number of detected variants/variant classes per individual was highly dependent on the sequencing technology. We observed a statistically significant overrepresentation of variants uniquely called by a single platform which indicates potential systematic biases. These variants were enriched in low complexity genomic regions and simple repeats. Furthermore, estimates of allele frequency were highly discrepant for a subset of variants in pairwise comparisons between different sequencing platforms. Applying common filters – such as MAF 5% and HWE- greatly reduced the heterogeneity between cohorts but still left discrepancies of several thousand variants after filtering.Conclusion We provide empirical evidence of systematic heterogeneity in variant calls between alternative experimental and data analysis setups. Our results highlight the potential benefit of reprocessing genomic data with harmonized pipelines when integrating data from different studies.

Download Full-text

Reliability of genomic variants across different next-generation sequencing platforms and bioinformatic processing pipelines

10.21203/rs.3.rs-50691/v2 ◽

2020 ◽

Author(s):

Stephan Weißbach ◽

Stanislav Jur`Evic Sys ◽

Charlotte Hewel ◽

Hristo Todorov ◽

Susann Schweiger ◽

...

Keyword(s):

Next Generation Sequencing ◽

Gc Content ◽

Nucleotide Polymorphisms ◽

Next Generation ◽

Sequencing Data ◽

Illumina Hiseq ◽

Cross Sectional ◽

Single Nucleotide ◽

Alu Elements ◽

Generation Sequencing

Abstract Background Next Generation Sequencing (NGS) is the fundament of various studies, providing insights into questions from biology and medicine. Nevertheless, integrating data from different experimental backgrounds can introduce strong biases. In order to methodically investigate the magnitude of systematic errors in single nucleotide variant calls, we performed a cross-sectional observational study on a genomic cohort of 99 subjects each sequenced via (i) Illumina HiSeq X, (ii) Illumina HiSeq, and (iii) Complete Genomics and processed with the respective bioinformatic pipeline. We also repeated variant calling for the Illumina cohorts with GATK, which allowed us to investigate the effect of the bioinformatics analysis strategy separately from the sequencing platform's impact.Results The number of detected variants/variant classes per individual was highly dependent on the experimental setup. We observed a statistically significant overrepresentation of variants uniquely called by a single setup, indicating potential systematic biases. Insertion/deletion polymorphisms (InDels) were associated with decreased concordance compared to single nucleotide polymorphisms (SNPs). The discrepancies in InDel absolute numbers were particularly prominent in introns, Alu elements, simple repeats, and regions with medium GC content. Notably, reprocessing sequencing data following the best practice recommendations of GATK considerably improved concordance between the respective setups.Conclusion We provide empirical evidence of systematic heterogeneity in variant calls between alternative experimental and data analysis setups. Furthermore, our results demonstrate the benefit of reprocessing genomic data with harmonized pipelines when integrating data from different studies.

Download Full-text

Reliability of genomic variants across different next-generation sequencing platforms and bioinformatic processing pipelines

BMC Genomics ◽

10.1186/s12864-020-07362-8 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Stephan Weißbach ◽

Stanislav Sys ◽

Charlotte Hewel ◽

Hristo Todorov ◽

Susann Schweiger ◽

...

Keyword(s):

Next Generation Sequencing ◽

Gc Content ◽

Nucleotide Polymorphisms ◽

Next Generation ◽

Sequencing Data ◽

Illumina Hiseq ◽

Cross Sectional ◽

Single Nucleotide ◽

Alu Elements ◽

Generation Sequencing

Abstract Background Next Generation Sequencing (NGS) is the fundament of various studies, providing insights into questions from biology and medicine. Nevertheless, integrating data from different experimental backgrounds can introduce strong biases. In order to methodically investigate the magnitude of systematic errors in single nucleotide variant calls, we performed a cross-sectional observational study on a genomic cohort of 99 subjects each sequenced via (i) Illumina HiSeq X, (ii) Illumina HiSeq, and (iii) Complete Genomics and processed with the respective bioinformatic pipeline. We also repeated variant calling for the Illumina cohorts with GATK, which allowed us to investigate the effect of the bioinformatics analysis strategy separately from the sequencing platform’s impact. Results The number of detected variants/variant classes per individual was highly dependent on the experimental setup. We observed a statistically significant overrepresentation of variants uniquely called by a single setup, indicating potential systematic biases. Insertion/deletion polymorphisms (indels) were associated with decreased concordance compared to single nucleotide polymorphisms (SNPs). The discrepancies in indel absolute numbers were particularly prominent in introns, Alu elements, simple repeats, and regions with medium GC content. Notably, reprocessing sequencing data following the best practice recommendations of GATK considerably improved concordance between the respective setups. Conclusion We provide empirical evidence of systematic heterogeneity in variant calls between alternative experimental and data analysis setups. Furthermore, our results demonstrate the benefit of reprocessing genomic data with harmonized pipelines when integrating data from different studies.

Download Full-text

Reliability of genomic variants across different next-generation sequencing platforms and bioinformatic processing pipelines

10.21203/rs.3.rs-50691/v3 ◽

2021 ◽

Author(s):

Stephan Weißbach ◽

Stanislav Jur`Evic Sys ◽

Charlotte Hewel ◽

Hristo Todorov ◽

Susann Schweiger ◽

...

Keyword(s):

Next Generation Sequencing ◽

Gc Content ◽

Nucleotide Polymorphisms ◽

Next Generation ◽

Sequencing Data ◽

Illumina Hiseq ◽

Cross Sectional ◽

Single Nucleotide ◽

Alu Elements ◽

Generation Sequencing

Abstract BackgroundNext Generation Sequencing (NGS) is the fundament of various studies, providing insights into questions from biology and medicine. Nevertheless, integrating data from different experimental backgrounds can introduce strong biases. In order to methodically investigate the magnitude of systematic errors in single nucleotide variant calls, we performed a cross-sectional observational study on a genomic cohort of 99 subjects each sequenced via (i) Illumina HiSeq X, (ii) Illumina HiSeq, and (iii) Complete Genomics and processed with the respective bioinformatic pipeline. We also repeated variant calling for the Illumina cohorts with GATK, which allowed us to investigate the effect of the bioinformatics analysis strategy separately from the sequencing platform's impact. Results The number of detected variants/variant classes per individual was highly dependent on the experimental setup. We observed a statistically significant overrepresentation of variants uniquely called by a single setup, indicating potential systematic biases. Insertion/deletion polymorphisms (indels) were associated with decreased concordance compared to single nucleotide polymorphisms (SNPs). The discrepancies in indel absolute numbers were particularly prominent in introns, Alu elements, simple repeats, and regions with medium GC content. Notably, reprocessing sequencing data following the best practice recommendations of GATK considerably improved concordance between the respective setups.ConclusionWe provide empirical evidence of systematic heterogeneity in variant calls between alternative experimental and data analysis setups. Furthermore, our results demonstrate the benefit of reprocessing genomic data with harmonized pipelines when integrating data from different studies.

Download Full-text

Next-Generation Sequencing for Congenital Nephrotic Syndrome: A Multi-Center Cross-Sectional Study from India

Indian Pediatrics ◽

10.1007/s13312-021-2215-5 ◽

2021 ◽

Vol 58 (5) ◽

pp. 445-451

Author(s):

Aditi Joshi ◽

◽

Aditi Sinha ◽

Aakanksha Sharma ◽

Uzma Shamim ◽

...

Keyword(s):

Nephrotic Syndrome ◽

Next Generation Sequencing ◽

Congenital Nephrotic Syndrome ◽

Cross Sectional Study ◽

Next Generation ◽

Sectional Study ◽

Cross Sectional ◽

Generation Sequencing

Download Full-text

The Accuracy, Feasibility and Challenges of Sequencing Short Tandem Repeats Using Next-Generation Sequencing Platforms

PLoS ONE ◽

10.1371/journal.pone.0113862 ◽

2014 ◽

Vol 9 (12) ◽

pp. e113862 ◽

Cited By ~ 15

Author(s):

Monika Zavodna ◽

Andrew Bagshaw ◽

Rudiger Brauning ◽

Neil J. Gemmell

Keyword(s):

Next Generation Sequencing ◽

Short Tandem Repeats ◽

Tandem Repeats ◽

Next Generation ◽

Sequencing Platforms ◽

Generation Sequencing ◽

Short Tandem

Download Full-text

Estimating Allele Frequency from Next-Generation Sequencing of Pooled Mitochondrial DNA Samples

Frontiers in Genetics ◽

10.3389/fgene.2011.00051 ◽

2011 ◽

Vol 2 ◽

Cited By ~ 5

Author(s):

Tao Wang ◽

Kith Pradhan ◽

Kenny Ye ◽

Lee-Jun Wong ◽

Thomas E. Rohan

Keyword(s):

Mitochondrial Dna ◽

Next Generation Sequencing ◽

Allele Frequency ◽

Next Generation ◽

Generation Sequencing

Download Full-text

Next-generation sequencing of adrenocortical carcinoma reveals new routes to targeted therapies

Journal of Clinical Pathology ◽

10.1136/jclinpath-2014-202514 ◽

2014 ◽

Vol 67 (11) ◽

pp. 968-973 ◽

Cited By ~ 33

Author(s):

J S Ross ◽

K Wang ◽

J V Rand ◽

L Gay ◽

M J Presta ◽

...

Keyword(s):

Next Generation Sequencing ◽

Adrenocortical Carcinoma ◽

Single Case ◽

Next Generation ◽

Illumina Hiseq ◽

Modest Improvement ◽

Aggressive Form ◽

Comprehensive Genomic Profiling ◽

Cytotoxic Therapies ◽

Generation Sequencing

AimsAdrenocortical carcinoma (ACC) carries a poor prognosis and current systemic cytotoxic therapies result in only modest improvement in overall survival. In this retrospective study, we performed a comprehensive genomic profiling of 29 consecutive ACC samples to identify potential targets of therapy not currently searched for in routine clinical practice.MethodsDNA from 29 ACC was sequenced to high, uniform coverage (Illumina HiSeq) and analysed for genomic alterations (GAs).ResultsAt least one GA was found in 22 (76%) ACC (mean 2.6 alterations per ACC). The most frequent GAs were in TP53 (34%), NF1 (14%), CDKN2A (14%), MEN1 (14%), CTNNB1 (10%) and ATM (10%). APC, CCND2, CDK4, DAXX, DNMT3A, KDM5C, LRP1B, MSH2 and RB1 were each altered in two cases (7%) and EGFR, ERBB4, KRAS, MDM2, NRAS, PDGFRB, PIK3CA, PTEN and PTCH1 were each altered in a single case (3%). In 17 (59%) of ACC, at least one GA was associated with an available therapeutic or a mechanism-based clinical trial.ConclusionsNext-generation sequencing can discover targets of therapy for relapsed and metastatic ACC and shows promise to improve outcomes for this aggressive form of cancer.

Download Full-text

High throughput crop genome genotyping by a combination of pool next generation sequencing and haplotype-based data processing

10.21203/rs.3.rs-415602/v1 ◽

2021 ◽

Author(s):

Michael Schneider ◽

Asis Shrestha ◽

Agim Ballvora ◽

Jens Leon

Keyword(s):

Next Generation Sequencing ◽

Allele Frequency ◽

Frequency Estimation ◽

Whole Genome ◽

Next Generation ◽

Conservation Genomics ◽

High Coverage ◽

Allele Frequency Estimation ◽

Low Coverage ◽

Generation Sequencing

Abstract BackgroundThe identification of environmentally specific alleles and the observation of evolutional processes is a goal of conservation genomics. By generational changes of allele frequencies in populations, questions regarding effective population size, gene flow, drift, and selection can be addressed. The observation of such effects often is a trade-off of costs and resolution, when a decent sample of genotypes should be genotyped for many loci. Pool genotyping approaches can derive a high resolution and precision in allele frequency estimation, when high coverage sequencing is utilized. Still, pool high coverage pool sequencing of big genomes comes along with high costs.ResultsHere we present a reliable method to estimate a barley population’s allele frequency at low coverage sequencing. Three hundred genotypes were sampled from a barley backcross population to estimate the entire population’s allele frequency. The allele frequency estimation accuracy and yield were compared for three next generation sequencing methods. To reveal accurate allele frequency estimates on a low coverage sequencing level, a haplotyping approach was performed. Low coverage allele frequency of positional connected single polymorphisms were aggregated to a single haplotype allele frequency, resulting in two to 271 times higher depth and increased precision. We compared different haplotyping tactics, showing that gene and chip marker-based haplotypes perform on par or better than simple contig haplotype windows. The comparison of multiple pool samples and the referencing against an individual sequencing approach revealed whole genome pool resequencing having the highest correlation to individual genotyping (up to 0.97), while transcriptomics and genotyping by sequencing indicated higher error rates and lower correlations.ConclusionUsing the proposed method allows to identify the allele frequency of populations with high accuracy at low cost. This is particularly interesting for conservation genomics in species with big genomes, like barley or wheat. Whole genome low coverage resequencing at 10x coverage can deliver a highly accurate estimation of the allele frequency, when a loci-based haplotyping approach is applied. Using annotated haplotypes allows to capitalize from biological background and statistical robustness.

Download Full-text

Diverse spectrum of rare deafness genes underlies early-childhood hearing loss in Japanese patients: a cross-sectional, multi-center next-generation sequencing study

Orphanet Journal of Rare Diseases ◽

10.1186/1750-1172-8-172 ◽

2013 ◽

Vol 8 (1) ◽

pp. 172 ◽

Cited By ~ 57

Author(s):

Hideki Mutai ◽

Naohiro Suzuki ◽

Atsushi Shimizu ◽

Chiharu Torii ◽

Kazunori Namba ◽

...

Keyword(s):

Early Childhood ◽

Hearing Loss ◽

Next Generation Sequencing ◽

Japanese Patients ◽

Next Generation ◽

Cross Sectional ◽

Generation Sequencing ◽

Childhood Hearing Loss

Download Full-text

Hi-C chromosome conformation capture sequencing of avian genomes using the BGISEQ-500 platform

GigaScience ◽

10.1093/gigascience/giaa087 ◽

2020 ◽

Vol 9 (8) ◽

Author(s):

Marcela Sandoval-Velasco ◽

Juan Antonio Rodríguez ◽

Cynthia Perez Estrada ◽

Guojie Zhang ◽

Erez Lieberman Aiden ◽

...

Keyword(s):

Next Generation Sequencing ◽

High Throughput Sequencing ◽

Data Generation ◽

Next Generation ◽

Sequencing Data ◽

Yield Data ◽

Chromosome Conformation ◽

Sequencing Platform ◽

Sequencing Platforms ◽

Generation Sequencing

Abstract Background Hi-C experiments couple DNA-DNA proximity with next-generation sequencing to yield an unbiased description of genome-wide interactions. Previous methods describing Hi-C experiments have focused on the industry-standard Illumina sequencing. With new next-generation sequencing platforms such as BGISEQ-500 becoming more widely available, protocol adaptations to fit platform-specific requirements are useful to give increased choice to researchers who routinely generate sequencing data. Results We describe an in situ Hi-C protocol adapted to be compatible with the BGISEQ-500 high-throughput sequencing platform. Using zebra finch (Taeniopygia guttata) as a biological sample, we demonstrate how Hi-C libraries can be constructed to generate informative data using the BGISEQ-500 platform, following circularization and DNA nanoball generation. Our protocol is a modification of an Illumina-compatible method, based around blunt-end ligations in library construction, using un-barcoded, distally overhanging double-stranded adapters, followed by amplification using indexed primers. The resulting libraries are ready for circularization and subsequent sequencing on the BGISEQ series of platforms and yield data similar to what can be expected using Illumina-compatible approaches. Conclusions Our straightforward modification to an Illumina-compatible in situHi-C protocol enables data generation on the BGISEQ series of platforms, thus expanding the options available for researchers who wish to utilize the powerful Hi-C techniques in their research.

Download Full-text