BAMixChecker: an automated checkup tool for matched sample pairs in NGS cohort

Hein Chun; Sangwoo Kim

doi:10.1093/bioinformatics/btz479

BAMixChecker: an automated checkup tool for matched sample pairs in NGS cohort

Bioinformatics ◽

10.1093/bioinformatics/btz479 ◽

2019 ◽

Vol 35 (22) ◽

pp. 4806-4808 ◽

Cited By ~ 2

Author(s):

Hein Chun ◽

Sangwoo Kim

Keyword(s):

Genomic Analysis ◽

Supplementary Information ◽

Nucleotide Polymorphisms ◽

Rna Seq ◽

Sequencing Data ◽

Single Nucleotide ◽

Frequent Problem ◽

Generation Sequencing ◽

User Intervention ◽

Genotype Concordance

Abstract Summary Mislabeling in the process of next generation sequencing is a frequent problem that can cause an entire genomic analysis to fail, and a regular cohort-level checkup is needed to ensure that it has not occurred. We developed a new, automated tool (BAMixChecker) that accurately detects sample mismatches from a given BAM file cohort with minimal user intervention. BAMixChecker uses a flexible, data-specific set of single-nucleotide polymorphisms and detects orphan (unpaired) and swapped (mispaired) samples based on genotype-concordance score and entropy-based file name analysis. BAMixChecker shows ∼100% accuracy in real WES, RNA-Seq and targeted sequencing data cohorts, even for small panels (<50 genes). BAMixChecker provides an HTML-style report that graphically outlines the sample matching status in tables and heatmaps, with which users can quickly inspect any mismatch events. Availability and implementation BAMixChecker is available at https://github.com/heinc1010/BAMixChecker Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

A support vector machine for identification of single-nucleotide polymorphisms from next-generation sequencing data

Bioinformatics ◽

10.1093/bioinformatics/btt172 ◽

2013 ◽

Vol 29 (11) ◽

pp. 1361-1366 ◽

Cited By ~ 26

Author(s):

B. D. O'Fallon ◽

W. Wooderchak-Donahue ◽

D. K. Crockett

Keyword(s):

Support Vector Machine ◽

Next Generation Sequencing ◽

Single Nucleotide Polymorphisms ◽

Next Generation Sequencing Data ◽

Support Vector ◽

Nucleotide Polymorphisms ◽

Next Generation ◽

Sequencing Data ◽

Single Nucleotide ◽

Generation Sequencing

Download Full-text

VEF: a variant filtering tool based on ensemble methods

Bioinformatics ◽

10.1093/bioinformatics/btz952 ◽

2019 ◽

Vol 36 (8) ◽

pp. 2328-2336

Author(s):

Chuanyi Zhang ◽

Idoia Ochoa

Keyword(s):

Missing Values ◽

Genomic Analysis ◽

Ensemble Methods ◽

Supplementary Information ◽

Nucleotide Polymorphisms ◽

Single Nucleotide ◽

Significant Saving ◽

Variant Call ◽

Variant Filtering ◽

Human Sample

Abstract Motivation Variants identified by current genomic analysis pipelines contain many incorrectly called variants. These can be potentially eliminated by applying state-of-the-art filtering tools, such as Variant Quality Score Recalibration (VQSR) or Hard Filtering (HF). However, these methods are very user-dependent and fail to run in some cases. We propose VEF, a variant filtering tool based on decision tree ensemble methods that overcomes the main drawbacks of VQSR and HF. Contrary to these methods, we treat filtering as a supervised learning problem, using variant call data with known ‘true’ variants, i.e. gold standard, for training. Once trained, VEF can be directly applied to filter the variants contained in a given Variants Call Format (VCF) file (we consider training and testing VCF files generated with the same tools, as we assume they will share feature characteristics). Results For the analysis, we used whole genome sequencing (WGS) Human datasets for which the gold standards are available. We show on these data that the proposed filtering tool VEF consistently outperforms VQSR and HF. In addition, we show that VEF generalizes well even when some features have missing values, when the training and testing datasets differ in coverage, and when sequencing pipelines other than GATK are used. Finally, since the training needs to be performed only once, there is a significant saving in running time when compared with VQSR (4 versus 50 min approximately for filtering the single nucleotide polymorphisms of a WGS Human sample). Availability and Implementation Code and scripts available at: github.com/ChuanyiZ/vef. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Reliability of genomic variants across different next-generation sequencing platforms and bioinformatic processing pipelines

10.21203/rs.3.rs-50691/v2 ◽

2020 ◽

Author(s):

Stephan Weißbach ◽

Stanislav Jur`Evic Sys ◽

Charlotte Hewel ◽

Hristo Todorov ◽

Susann Schweiger ◽

...

Keyword(s):

Next Generation Sequencing ◽

Gc Content ◽

Nucleotide Polymorphisms ◽

Next Generation ◽

Sequencing Data ◽

Illumina Hiseq ◽

Cross Sectional ◽

Single Nucleotide ◽

Alu Elements ◽

Generation Sequencing

Abstract Background Next Generation Sequencing (NGS) is the fundament of various studies, providing insights into questions from biology and medicine. Nevertheless, integrating data from different experimental backgrounds can introduce strong biases. In order to methodically investigate the magnitude of systematic errors in single nucleotide variant calls, we performed a cross-sectional observational study on a genomic cohort of 99 subjects each sequenced via (i) Illumina HiSeq X, (ii) Illumina HiSeq, and (iii) Complete Genomics and processed with the respective bioinformatic pipeline. We also repeated variant calling for the Illumina cohorts with GATK, which allowed us to investigate the effect of the bioinformatics analysis strategy separately from the sequencing platform's impact.Results The number of detected variants/variant classes per individual was highly dependent on the experimental setup. We observed a statistically significant overrepresentation of variants uniquely called by a single setup, indicating potential systematic biases. Insertion/deletion polymorphisms (InDels) were associated with decreased concordance compared to single nucleotide polymorphisms (SNPs). The discrepancies in InDel absolute numbers were particularly prominent in introns, Alu elements, simple repeats, and regions with medium GC content. Notably, reprocessing sequencing data following the best practice recommendations of GATK considerably improved concordance between the respective setups.Conclusion We provide empirical evidence of systematic heterogeneity in variant calls between alternative experimental and data analysis setups. Furthermore, our results demonstrate the benefit of reprocessing genomic data with harmonized pipelines when integrating data from different studies.

Download Full-text

Reliability of genomic variants across different next-generation sequencing platforms and bioinformatic processing pipelines

BMC Genomics ◽

10.1186/s12864-020-07362-8 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Stephan Weißbach ◽

Stanislav Sys ◽

Charlotte Hewel ◽

Hristo Todorov ◽

Susann Schweiger ◽

...

Keyword(s):

Next Generation Sequencing ◽

Gc Content ◽

Nucleotide Polymorphisms ◽

Next Generation ◽

Sequencing Data ◽

Illumina Hiseq ◽

Cross Sectional ◽

Single Nucleotide ◽

Alu Elements ◽

Generation Sequencing

Abstract Background Next Generation Sequencing (NGS) is the fundament of various studies, providing insights into questions from biology and medicine. Nevertheless, integrating data from different experimental backgrounds can introduce strong biases. In order to methodically investigate the magnitude of systematic errors in single nucleotide variant calls, we performed a cross-sectional observational study on a genomic cohort of 99 subjects each sequenced via (i) Illumina HiSeq X, (ii) Illumina HiSeq, and (iii) Complete Genomics and processed with the respective bioinformatic pipeline. We also repeated variant calling for the Illumina cohorts with GATK, which allowed us to investigate the effect of the bioinformatics analysis strategy separately from the sequencing platform’s impact. Results The number of detected variants/variant classes per individual was highly dependent on the experimental setup. We observed a statistically significant overrepresentation of variants uniquely called by a single setup, indicating potential systematic biases. Insertion/deletion polymorphisms (indels) were associated with decreased concordance compared to single nucleotide polymorphisms (SNPs). The discrepancies in indel absolute numbers were particularly prominent in introns, Alu elements, simple repeats, and regions with medium GC content. Notably, reprocessing sequencing data following the best practice recommendations of GATK considerably improved concordance between the respective setups. Conclusion We provide empirical evidence of systematic heterogeneity in variant calls between alternative experimental and data analysis setups. Furthermore, our results demonstrate the benefit of reprocessing genomic data with harmonized pipelines when integrating data from different studies.

Download Full-text

Reliability of genomic variants across different next-generation sequencing platforms and bioinformatic processing pipelines

10.21203/rs.3.rs-50691/v3 ◽

2021 ◽

Author(s):

Stephan Weißbach ◽

Stanislav Jur`Evic Sys ◽

Charlotte Hewel ◽

Hristo Todorov ◽

Susann Schweiger ◽

...

Keyword(s):

Next Generation Sequencing ◽

Gc Content ◽

Nucleotide Polymorphisms ◽

Next Generation ◽

Sequencing Data ◽

Illumina Hiseq ◽

Cross Sectional ◽

Single Nucleotide ◽

Alu Elements ◽

Generation Sequencing

Abstract BackgroundNext Generation Sequencing (NGS) is the fundament of various studies, providing insights into questions from biology and medicine. Nevertheless, integrating data from different experimental backgrounds can introduce strong biases. In order to methodically investigate the magnitude of systematic errors in single nucleotide variant calls, we performed a cross-sectional observational study on a genomic cohort of 99 subjects each sequenced via (i) Illumina HiSeq X, (ii) Illumina HiSeq, and (iii) Complete Genomics and processed with the respective bioinformatic pipeline. We also repeated variant calling for the Illumina cohorts with GATK, which allowed us to investigate the effect of the bioinformatics analysis strategy separately from the sequencing platform's impact. Results The number of detected variants/variant classes per individual was highly dependent on the experimental setup. We observed a statistically significant overrepresentation of variants uniquely called by a single setup, indicating potential systematic biases. Insertion/deletion polymorphisms (indels) were associated with decreased concordance compared to single nucleotide polymorphisms (SNPs). The discrepancies in indel absolute numbers were particularly prominent in introns, Alu elements, simple repeats, and regions with medium GC content. Notably, reprocessing sequencing data following the best practice recommendations of GATK considerably improved concordance between the respective setups.ConclusionWe provide empirical evidence of systematic heterogeneity in variant calls between alternative experimental and data analysis setups. Furthermore, our results demonstrate the benefit of reprocessing genomic data with harmonized pipelines when integrating data from different studies.

Download Full-text

Next-generation Sequence-analysis Toolkit (NeST): A standardized bioinformatics framework for analyzing Single Nucleotide Polymorphisms in next-generation sequencing data

10.1101/323535 ◽

2018 ◽

Author(s):

Shashidhar Ravishankar ◽

Sarah E. Schmedes ◽

Dhruviben S. Patel ◽

Mateusz Plucinski ◽

Venkatachalam Udhayakumar ◽

...

Keyword(s):

Next Generation Sequencing ◽

Single Nucleotide Polymorphisms ◽

Variant Calling ◽

Next Generation Sequencing Data ◽

Nucleotide Polymorphisms ◽

Next Generation ◽

Sequencing Data ◽

Single Nucleotide ◽

Bioinformatics Tools ◽

Generation Sequencing

AbstractRapid advancements in next-generation sequencing (NGS) technologies have led to the development of numerous bioinformatics tools and pipelines. As these tools vary in their output function and complexity and some are not well-standardized, it is harder to choose a suitable pipeline to identify variants in NGS data. Here, we present NeST (NGS-analysis Toolkit), a modular consensus-based variant calling framework. NeST uses a combination of variant callers to overcome potential biases of an individual method used alone. NeST consists of four modules, that integrate open-source bioinformatics tools, a custom Variant Calling Format (VCF) parser and a summarization utility, that generate high-quality consensus variant calls. NeST was validated using targeted-amplicon deep sequencing data from 245 Plasmodium falciparum isolates to identify single-nucleotide polymorphisms conferring drug resistance. The results were verified using Sanger sequencing data for the same dataset in a supporting publication [28]. NeST offers a user-friendly pipeline for variant calling with standardized outputs and minimal computational demands for easy deployment for use with various organisms and applications.

Download Full-text

Machine learning as an effective method for identifying true SNPs in polyploid plants

10.1101/274407 ◽

2018 ◽

Cited By ~ 1

Author(s):

Walid Korani ◽

Josh P. Clevenger ◽

Ye Chu ◽

Peggy Ozias-Akins

Keyword(s):

Machine Learning ◽

Sequence Data ◽

Snp Array ◽

Real Data ◽

Large Set ◽

Nucleotide Polymorphisms ◽

Rna Seq ◽

Sequencing Data ◽

Single Nucleotide ◽

Accuracy Rates

AbstractSingle Nucleotide Polymorphisms (SNPs) have many advantages as molecular markers since they are ubiquitous and co-dominant. However, the discovery of true SNPs especially in polyploid species is difficult. Peanut is an allopolyploid, which has a very low rate of true SNP calling. A large set of true and false SNPs identified from the Arachis 58k Affymetrix array was leveraged to train machine learning models to select true SNPs straight from sequence data. These models achieved accuracy rates of above 80% using real peanut RNA-seq and whole genome shotgun (WGS) re-sequencing data, which is higher than previously reported for polyploids. A 48K SNP array, Axiom Arachis2, was designed using the approach which revealed 75% accuracy of calling SNPs from different tetraploid peanut genotypes. Using the method to simulate SNP variation in peanut, cotton, wheat, and strawberry, we show that models built with our parameter sets achieve above 98% accuracy in selecting true SNPs. Additionally, models built with simulated genotypes were able to select true SNPs at above 80% accuracy using real peanut data, demonstrating that our model can be used even if real data are not available to train the models. This work demonstrates an effective approach for calling highly reliable SNPs from polyploids using machine learning. A novel tool was developed for predicting true SNPs from sequence data, designated as SNP-ML (SNP-Machine Learning, pronounced “snip mill”), using the described models. SNP-ML additionally provides functionality to train new models not included in this study for customized use, designated SNP-MLer (SNP-Machine Learner, pronounced “snip miller”). SNP-ML is freely available for public use.

Download Full-text

Physiological RNA dynamics in RNA-Seq analysis

Briefings in Bioinformatics ◽

10.1093/bib/bby045 ◽

2018 ◽

Vol 20 (5) ◽

pp. 1725-1733 ◽

Cited By ~ 1

Author(s):

Zhongneng Xu ◽

Shuichi Asakawa

Keyword(s):

Rna Degradation ◽

Decay Rates ◽

Sequence Length ◽

Nucleotide Polymorphisms ◽

Rna Seq ◽

Sequencing Data ◽

Single Nucleotide ◽

Rna Dynamics ◽

Rna Quantification ◽

Rna Accumulation

Abstract Physiological RNA dynamics cause problems in transcriptome analysis. Physiological RNA accumulation affects the analysis of RNA quantification, and physiological RNA degradation affects the analysis of the RNA sequence length, feature site and quantification. In the present article, we review the effects of physiological degradation and accumulation of RNA on analysing RNA sequencing data. Physiological RNA accumulation and degradation probably led to such phenomena as incorrect estimations of transcription quantification, differential expressions, co-expressions, RNA decay rates, alternative splicing, boundaries of transcription, novel genes, new single-nucleotide polymorphisms, small RNAs and gene fusion. Thus, the transcriptomic data obtained up to date warrant further scrutiny. New and improved techniques and bioinformatics software are needed to produce accurate data in transcriptome research.

Download Full-text

Quantification of fetal DNA in the plasma of pregnant women using next generation sequencing of frequent single nucleotide polymorphisms

Bulletin of Russian State Medical University ◽

10.24075/brsmu.2018.031 ◽

2018 ◽

pp. 29-33

Author(s):

J. Shubina ◽

◽

T Jankevic ◽

A. Yu. Goltsov ◽

I. S. Mukosey ◽

...

Keyword(s):

Next Generation Sequencing ◽

Single Nucleotide Polymorphisms ◽

Pregnant Women ◽

Nucleotide Polymorphisms ◽

Next Generation ◽

Single Nucleotide ◽

Fetal Dna ◽

Generation Sequencing

Download Full-text

Risk prediction and marker selection in nonsynonymous single nucleotide polymorphisms using whole genome sequencing data

Animal Cells and Systems ◽

10.1080/19768354.2020.1860125 ◽

2020 ◽

Vol 24 (6) ◽

pp. 321-328

Author(s):

Young-Sup Lee ◽

KyeongHye Won ◽

Donghyun Shin ◽

Jae-Don Oh

Keyword(s):

Single Nucleotide Polymorphisms ◽

Whole Genome Sequencing ◽

Risk Prediction ◽

Genome Sequencing ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Nucleotide Polymorphisms ◽

Sequencing Data ◽

Single Nucleotide ◽

Marker Selection

Download Full-text