scholarly journals A support vector machine for identification of single-nucleotide polymorphisms from next-generation sequencing data

2013 ◽  
Vol 29 (11) ◽  
pp. 1361-1366 ◽  
Author(s):  
B. D. O'Fallon ◽  
W. Wooderchak-Donahue ◽  
D. K. Crockett
2018 ◽  
Author(s):  
Shashidhar Ravishankar ◽  
Sarah E. Schmedes ◽  
Dhruviben S. Patel ◽  
Mateusz Plucinski ◽  
Venkatachalam Udhayakumar ◽  
...  

AbstractRapid advancements in next-generation sequencing (NGS) technologies have led to the development of numerous bioinformatics tools and pipelines. As these tools vary in their output function and complexity and some are not well-standardized, it is harder to choose a suitable pipeline to identify variants in NGS data. Here, we present NeST (NGS-analysis Toolkit), a modular consensus-based variant calling framework. NeST uses a combination of variant callers to overcome potential biases of an individual method used alone. NeST consists of four modules, that integrate open-source bioinformatics tools, a custom Variant Calling Format (VCF) parser and a summarization utility, that generate high-quality consensus variant calls. NeST was validated using targeted-amplicon deep sequencing data from 245 Plasmodium falciparum isolates to identify single-nucleotide polymorphisms conferring drug resistance. The results were verified using Sanger sequencing data for the same dataset in a supporting publication [28]. NeST offers a user-friendly pipeline for variant calling with standardized outputs and minimal computational demands for easy deployment for use with various organisms and applications.


2020 ◽  
Vol 11 ◽  
Author(s):  
Haiyong Zhao ◽  
Shuang Wang ◽  
Xiguo Yuan

Next-generation sequencing (NGS) technologies have provided great opportunities to analyze pathogenic microbes with high-resolution data. The main goal is to accurately detect microbial composition and abundances in a sample. However, high similarity among sequences from different species and the existence of sequencing errors pose various challenges. Numerous methods have been developed for quantifying microbial composition and abundance, but they are not versatile enough for the analysis of samples with mixtures of noise. In this paper, we propose a new computational method, PGMicroD, for the detection of pathogenic microbial composition in a sample using NGS data. The method first filters the potentially mistakenly mapped reads and extracts multiple species-related features from the sequencing reads of 16S rRNA. Then it trains an Support Vector Machine classifier to predict the microbial composition. Finally, it groups all multiple-mapped sequencing reads into the references of the predicted species to estimate the abundance for each kind of species. The performance of PGMicroD is evaluated based on both simulation and real sequencing data and is compared with several existing methods. The results demonstrate that our proposed method achieves superior performance. The software package of PGMicroD is available at https://github.com/BDanalysis/PGMicroD.


Gene Reports ◽  
2017 ◽  
Vol 6 ◽  
pp. 81-86 ◽  
Author(s):  
Kiran Dashrath Rasal ◽  
Vemlawada Chakrapani ◽  
Amrendra Kumar Pandey ◽  
Avinash Rambhau Rasal ◽  
Jitendra K. Sundaray ◽  
...  

2020 ◽  
Author(s):  
Stephan Weißbach ◽  
Stanislav Jur`Evic Sys ◽  
Charlotte Hewel ◽  
Hristo Todorov ◽  
Susann Schweiger ◽  
...  

Abstract Background Next Generation Sequencing (NGS) is the fundament of various studies, providing insights into questions from biology and medicine. Nevertheless, integrating data from different experimental backgrounds can introduce strong biases. In order to methodically investigate the magnitude of systematic errors in single nucleotide variant calls, we performed a cross-sectional observational study on a genomic cohort of 99 subjects each sequenced via (i) Illumina HiSeq X, (ii) Illumina HiSeq, and (iii) Complete Genomics and processed with the respective bioinformatic pipeline. We also repeated variant calling for the Illumina cohorts with GATK, which allowed us to investigate the effect of the bioinformatics analysis strategy separately from the sequencing platform's impact.Results The number of detected variants/variant classes per individual was highly dependent on the experimental setup. We observed a statistically significant overrepresentation of variants uniquely called by a single setup, indicating potential systematic biases. Insertion/deletion polymorphisms (InDels) were associated with decreased concordance compared to single nucleotide polymorphisms (SNPs). The discrepancies in InDel absolute numbers were particularly prominent in introns, Alu elements, simple repeats, and regions with medium GC content. Notably, reprocessing sequencing data following the best practice recommendations of GATK considerably improved concordance between the respective setups.Conclusion We provide empirical evidence of systematic heterogeneity in variant calls between alternative experimental and data analysis setups. Furthermore, our results demonstrate the benefit of reprocessing genomic data with harmonized pipelines when integrating data from different studies.


Sign in / Sign up

Export Citation Format

Share Document