scholarly journals BAMixChecker: an automated checkup tool for matched sample pairs in NGS cohort

2019 ◽  
Vol 35 (22) ◽  
pp. 4806-4808 ◽  
Author(s):  
Hein Chun ◽  
Sangwoo Kim

Abstract Summary Mislabeling in the process of next generation sequencing is a frequent problem that can cause an entire genomic analysis to fail, and a regular cohort-level checkup is needed to ensure that it has not occurred. We developed a new, automated tool (BAMixChecker) that accurately detects sample mismatches from a given BAM file cohort with minimal user intervention. BAMixChecker uses a flexible, data-specific set of single-nucleotide polymorphisms and detects orphan (unpaired) and swapped (mispaired) samples based on genotype-concordance score and entropy-based file name analysis. BAMixChecker shows ∼100% accuracy in real WES, RNA-Seq and targeted sequencing data cohorts, even for small panels (<50 genes). BAMixChecker provides an HTML-style report that graphically outlines the sample matching status in tables and heatmaps, with which users can quickly inspect any mismatch events. Availability and implementation BAMixChecker is available at https://github.com/heinc1010/BAMixChecker Supplementary information Supplementary data are available at Bioinformatics online.

2019 ◽  
Vol 36 (8) ◽  
pp. 2328-2336
Author(s):  
Chuanyi Zhang ◽  
Idoia Ochoa

Abstract Motivation Variants identified by current genomic analysis pipelines contain many incorrectly called variants. These can be potentially eliminated by applying state-of-the-art filtering tools, such as Variant Quality Score Recalibration (VQSR) or Hard Filtering (HF). However, these methods are very user-dependent and fail to run in some cases. We propose VEF, a variant filtering tool based on decision tree ensemble methods that overcomes the main drawbacks of VQSR and HF. Contrary to these methods, we treat filtering as a supervised learning problem, using variant call data with known ‘true’ variants, i.e. gold standard, for training. Once trained, VEF can be directly applied to filter the variants contained in a given Variants Call Format (VCF) file (we consider training and testing VCF files generated with the same tools, as we assume they will share feature characteristics). Results For the analysis, we used whole genome sequencing (WGS) Human datasets for which the gold standards are available. We show on these data that the proposed filtering tool VEF consistently outperforms VQSR and HF. In addition, we show that VEF generalizes well even when some features have missing values, when the training and testing datasets differ in coverage, and when sequencing pipelines other than GATK are used. Finally, since the training needs to be performed only once, there is a significant saving in running time when compared with VQSR (4 versus 50 min approximately for filtering the single nucleotide polymorphisms of a WGS Human sample). Availability and Implementation Code and scripts available at: github.com/ChuanyiZ/vef. Supplementary information Supplementary data are available at Bioinformatics online.


2020 ◽  
Author(s):  
Stephan Weißbach ◽  
Stanislav Jur`Evic Sys ◽  
Charlotte Hewel ◽  
Hristo Todorov ◽  
Susann Schweiger ◽  
...  

Abstract Background Next Generation Sequencing (NGS) is the fundament of various studies, providing insights into questions from biology and medicine. Nevertheless, integrating data from different experimental backgrounds can introduce strong biases. In order to methodically investigate the magnitude of systematic errors in single nucleotide variant calls, we performed a cross-sectional observational study on a genomic cohort of 99 subjects each sequenced via (i) Illumina HiSeq X, (ii) Illumina HiSeq, and (iii) Complete Genomics and processed with the respective bioinformatic pipeline. We also repeated variant calling for the Illumina cohorts with GATK, which allowed us to investigate the effect of the bioinformatics analysis strategy separately from the sequencing platform's impact.Results The number of detected variants/variant classes per individual was highly dependent on the experimental setup. We observed a statistically significant overrepresentation of variants uniquely called by a single setup, indicating potential systematic biases. Insertion/deletion polymorphisms (InDels) were associated with decreased concordance compared to single nucleotide polymorphisms (SNPs). The discrepancies in InDel absolute numbers were particularly prominent in introns, Alu elements, simple repeats, and regions with medium GC content. Notably, reprocessing sequencing data following the best practice recommendations of GATK considerably improved concordance between the respective setups.Conclusion We provide empirical evidence of systematic heterogeneity in variant calls between alternative experimental and data analysis setups. Furthermore, our results demonstrate the benefit of reprocessing genomic data with harmonized pipelines when integrating data from different studies.


BMC Genomics ◽  
2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Stephan Weißbach ◽  
Stanislav Sys ◽  
Charlotte Hewel ◽  
Hristo Todorov ◽  
Susann Schweiger ◽  
...  

Abstract Background Next Generation Sequencing (NGS) is the fundament of various studies, providing insights into questions from biology and medicine. Nevertheless, integrating data from different experimental backgrounds can introduce strong biases. In order to methodically investigate the magnitude of systematic errors in single nucleotide variant calls, we performed a cross-sectional observational study on a genomic cohort of 99 subjects each sequenced via (i) Illumina HiSeq X, (ii) Illumina HiSeq, and (iii) Complete Genomics and processed with the respective bioinformatic pipeline. We also repeated variant calling for the Illumina cohorts with GATK, which allowed us to investigate the effect of the bioinformatics analysis strategy separately from the sequencing platform’s impact. Results The number of detected variants/variant classes per individual was highly dependent on the experimental setup. We observed a statistically significant overrepresentation of variants uniquely called by a single setup, indicating potential systematic biases. Insertion/deletion polymorphisms (indels) were associated with decreased concordance compared to single nucleotide polymorphisms (SNPs). The discrepancies in indel absolute numbers were particularly prominent in introns, Alu elements, simple repeats, and regions with medium GC content. Notably, reprocessing sequencing data following the best practice recommendations of GATK considerably improved concordance between the respective setups. Conclusion We provide empirical evidence of systematic heterogeneity in variant calls between alternative experimental and data analysis setups. Furthermore, our results demonstrate the benefit of reprocessing genomic data with harmonized pipelines when integrating data from different studies.


2021 ◽  
Author(s):  
Stephan Weißbach ◽  
Stanislav Jur`Evic Sys ◽  
Charlotte Hewel ◽  
Hristo Todorov ◽  
Susann Schweiger ◽  
...  

Abstract BackgroundNext Generation Sequencing (NGS) is the fundament of various studies, providing insights into questions from biology and medicine. Nevertheless, integrating data from different experimental backgrounds can introduce strong biases. In order to methodically investigate the magnitude of systematic errors in single nucleotide variant calls, we performed a cross-sectional observational study on a genomic cohort of 99 subjects each sequenced via (i) Illumina HiSeq X, (ii) Illumina HiSeq, and (iii) Complete Genomics and processed with the respective bioinformatic pipeline. We also repeated variant calling for the Illumina cohorts with GATK, which allowed us to investigate the effect of the bioinformatics analysis strategy separately from the sequencing platform's impact. Results The number of detected variants/variant classes per individual was highly dependent on the experimental setup. We observed a statistically significant overrepresentation of variants uniquely called by a single setup, indicating potential systematic biases. Insertion/deletion polymorphisms (indels) were associated with decreased concordance compared to single nucleotide polymorphisms (SNPs). The discrepancies in indel absolute numbers were particularly prominent in introns, Alu elements, simple repeats, and regions with medium GC content. Notably, reprocessing sequencing data following the best practice recommendations of GATK considerably improved concordance between the respective setups.ConclusionWe provide empirical evidence of systematic heterogeneity in variant calls between alternative experimental and data analysis setups. Furthermore, our results demonstrate the benefit of reprocessing genomic data with harmonized pipelines when integrating data from different studies.


2018 ◽  
Author(s):  
Shashidhar Ravishankar ◽  
Sarah E. Schmedes ◽  
Dhruviben S. Patel ◽  
Mateusz Plucinski ◽  
Venkatachalam Udhayakumar ◽  
...  

AbstractRapid advancements in next-generation sequencing (NGS) technologies have led to the development of numerous bioinformatics tools and pipelines. As these tools vary in their output function and complexity and some are not well-standardized, it is harder to choose a suitable pipeline to identify variants in NGS data. Here, we present NeST (NGS-analysis Toolkit), a modular consensus-based variant calling framework. NeST uses a combination of variant callers to overcome potential biases of an individual method used alone. NeST consists of four modules, that integrate open-source bioinformatics tools, a custom Variant Calling Format (VCF) parser and a summarization utility, that generate high-quality consensus variant calls. NeST was validated using targeted-amplicon deep sequencing data from 245 Plasmodium falciparum isolates to identify single-nucleotide polymorphisms conferring drug resistance. The results were verified using Sanger sequencing data for the same dataset in a supporting publication [28]. NeST offers a user-friendly pipeline for variant calling with standardized outputs and minimal computational demands for easy deployment for use with various organisms and applications.


2018 ◽  
Author(s):  
Walid Korani ◽  
Josh P. Clevenger ◽  
Ye Chu ◽  
Peggy Ozias-Akins

AbstractSingle Nucleotide Polymorphisms (SNPs) have many advantages as molecular markers since they are ubiquitous and co-dominant. However, the discovery of true SNPs especially in polyploid species is difficult. Peanut is an allopolyploid, which has a very low rate of true SNP calling. A large set of true and false SNPs identified from the Arachis 58k Affymetrix array was leveraged to train machine learning models to select true SNPs straight from sequence data. These models achieved accuracy rates of above 80% using real peanut RNA-seq and whole genome shotgun (WGS) re-sequencing data, which is higher than previously reported for polyploids. A 48K SNP array, Axiom Arachis2, was designed using the approach which revealed 75% accuracy of calling SNPs from different tetraploid peanut genotypes. Using the method to simulate SNP variation in peanut, cotton, wheat, and strawberry, we show that models built with our parameter sets achieve above 98% accuracy in selecting true SNPs. Additionally, models built with simulated genotypes were able to select true SNPs at above 80% accuracy using real peanut data, demonstrating that our model can be used even if real data are not available to train the models. This work demonstrates an effective approach for calling highly reliable SNPs from polyploids using machine learning. A novel tool was developed for predicting true SNPs from sequence data, designated as SNP-ML (SNP-Machine Learning, pronounced “snip mill”), using the described models. SNP-ML additionally provides functionality to train new models not included in this study for customized use, designated SNP-MLer (SNP-Machine Learner, pronounced “snip miller”). SNP-ML is freely available for public use.


2018 ◽  
Vol 20 (5) ◽  
pp. 1725-1733 ◽  
Author(s):  
Zhongneng Xu ◽  
Shuichi Asakawa

Abstract Physiological RNA dynamics cause problems in transcriptome analysis. Physiological RNA accumulation affects the analysis of RNA quantification, and physiological RNA degradation affects the analysis of the RNA sequence length, feature site and quantification. In the present article, we review the effects of physiological degradation and accumulation of RNA on analysing RNA sequencing data. Physiological RNA accumulation and degradation probably led to such phenomena as incorrect estimations of transcription quantification, differential expressions, co-expressions, RNA decay rates, alternative splicing, boundaries of transcription, novel genes, new single-nucleotide polymorphisms, small RNAs and gene fusion. Thus, the transcriptomic data obtained up to date warrant further scrutiny. New and improved techniques and bioinformatics software are needed to produce accurate data in transcriptome research.


Sign in / Sign up

Export Citation Format

Share Document