scholarly journals VEF: a variant filtering tool based on ensemble methods

2019 ◽  
Vol 36 (8) ◽  
pp. 2328-2336
Author(s):  
Chuanyi Zhang ◽  
Idoia Ochoa

Abstract Motivation Variants identified by current genomic analysis pipelines contain many incorrectly called variants. These can be potentially eliminated by applying state-of-the-art filtering tools, such as Variant Quality Score Recalibration (VQSR) or Hard Filtering (HF). However, these methods are very user-dependent and fail to run in some cases. We propose VEF, a variant filtering tool based on decision tree ensemble methods that overcomes the main drawbacks of VQSR and HF. Contrary to these methods, we treat filtering as a supervised learning problem, using variant call data with known ‘true’ variants, i.e. gold standard, for training. Once trained, VEF can be directly applied to filter the variants contained in a given Variants Call Format (VCF) file (we consider training and testing VCF files generated with the same tools, as we assume they will share feature characteristics). Results For the analysis, we used whole genome sequencing (WGS) Human datasets for which the gold standards are available. We show on these data that the proposed filtering tool VEF consistently outperforms VQSR and HF. In addition, we show that VEF generalizes well even when some features have missing values, when the training and testing datasets differ in coverage, and when sequencing pipelines other than GATK are used. Finally, since the training needs to be performed only once, there is a significant saving in running time when compared with VQSR (4 versus 50 min approximately for filtering the single nucleotide polymorphisms of a WGS Human sample). Availability and Implementation Code and scripts available at: github.com/ChuanyiZ/vef. Supplementary information Supplementary data are available at Bioinformatics online.

2019 ◽  
Author(s):  
Chuanyi Zhang ◽  
Idoia Ochoa

AbstractMotivationVariant discovery is crucial in medical and clinical research, especially in the setting of personalized medicine. As such, precision in variant identification is paramount. However, variants identified by current genomic analysis pipelines contain many false positives (i.e., incorrectly called variants). These can be potentially eliminated by applying state-of-the-art filtering tools, such as the Variant Quality Score Recalibration (VQSR) or the Hard Filtering (HF), both proposed by GATK. However, these methods are very user-dependent and fail to run in some cases. We propose VEF, a variant filtering tool based on ensemble methods that overcomes the main drawbacks of VQSR and the HF. Contrary to these methods, we treat filtering as a supervised learning problem. This is possible by using for training variant call data for which the set of “true” variants is known, i.e., a gold standard exists. Hence, we can classify each variant in the training VCF file as true or false using the gold standard, and further use the annotations of each variant as features for the classification problem. Once trained, VEF can be directly applied to filter the variants contained in a given VCF file. Analysis of several ensemble methods revealed random forest as offering the best performance, and hence VEF uses a random forest for the classification task.ResultsAfter training VEF on a Whole Genome Sequencing (WGS) Human dataset of sample NA12878, we tested its performance on a WGS Human dataset of sample NA24385. For these two samples, the set of high-confident variants has been produced and made available. Results show that the proposed filtering tool VEF consistently outperforms VQSR and HF. In addition, we show that VEF generalizes well even when some features have missing values, and when the training and testing datasets differ either in coverage or in the sequencing machine that was used to generate the data. Finally, since the training needs to be performed only once, there is a significant saving in running time when compared to VQSR (50 minutes versus 4 minutes approximately for filtering the SNPs of WGS Human sample NA24385). Code and scripts available at: github.com/ChuanyiZ/vef.


2019 ◽  
Vol 35 (22) ◽  
pp. 4806-4808 ◽  
Author(s):  
Hein Chun ◽  
Sangwoo Kim

Abstract Summary Mislabeling in the process of next generation sequencing is a frequent problem that can cause an entire genomic analysis to fail, and a regular cohort-level checkup is needed to ensure that it has not occurred. We developed a new, automated tool (BAMixChecker) that accurately detects sample mismatches from a given BAM file cohort with minimal user intervention. BAMixChecker uses a flexible, data-specific set of single-nucleotide polymorphisms and detects orphan (unpaired) and swapped (mispaired) samples based on genotype-concordance score and entropy-based file name analysis. BAMixChecker shows ∼100% accuracy in real WES, RNA-Seq and targeted sequencing data cohorts, even for small panels (<50 genes). BAMixChecker provides an HTML-style report that graphically outlines the sample matching status in tables and heatmaps, with which users can quickly inspect any mismatch events. Availability and implementation BAMixChecker is available at https://github.com/heinc1010/BAMixChecker Supplementary information Supplementary data are available at Bioinformatics online.


2020 ◽  
Vol 36 (Supplement_2) ◽  
pp. i831-i839
Author(s):  
Dong-gi Lee ◽  
Myungjun Kim ◽  
Sang Joon Son ◽  
Chang Hyung Hong ◽  
Hyunjung Shin

Abstract Motivation Recently, various approaches for diagnosing and treating dementia have received significant attention, especially in identifying key genes that are crucial for dementia. If the mutations of such key genes could be tracked, it would be possible to predict the time of onset of dementia and significantly aid in developing drugs to treat dementia. However, gene finding involves tremendous cost, time and effort. To alleviate these problems, research on utilizing computational biology to decrease the search space of candidate genes is actively conducted. In this study, we propose a framework in which diseases, genes and single-nucleotide polymorphisms are represented by a layered network, and key genes are predicted by a machine learning algorithm. The algorithm utilizes a network-based semi-supervised learning model that can be applied to layered data structures. Results The proposed method was applied to a dataset extracted from public databases related to diseases and genes with data collected from 186 patients. A portion of key genes obtained using the proposed method was verified in silico through PubMed literature, and the remaining genes were left as possible candidate genes. Availability and implementation The code for the framework will be available at http://www.alphaminers.net/. Supplementary information Supplementary data are available at Bioinformatics online.


2019 ◽  
Vol 35 (21) ◽  
pp. 4442-4444 ◽  
Author(s):  
Jia-Xing Yue ◽  
Gianni Liti

Abstract Summary Simulated genomes with pre-defined and random genomic variants can be very useful for benchmarking genomic and bioinformatics analyses. Here we introduce simuG, a lightweight tool for simulating the full-spectrum of genomic variants (single nucleotide polymorphisms, Insertions/Deletions, copy number variants, inversions and translocations) for any organisms (including human). The simplicity and versatility of simuG make it a unique general-purpose genome simulator for a wide-range of simulation-based applications. Availability and implementation Code in Perl along with user manual and testing data is available at https://github.com/yjx1217/simuG. This software is free for use under the MIT license. Supplementary information Supplementary data are available at Bioinformatics online.


2019 ◽  
Author(s):  
Hemanoel Passarelli-Araujo ◽  
Jussara K. Palmeiro ◽  
Kanhu C. Moharana ◽  
Francisnei Pedrosa-Silva ◽  
Libera M. Dalla-Costa ◽  
...  

ABSTRACTKlebsiella aerogenesis an important pathogen in healthcare-associated infections. Nevertheless, in comparison to other clinically important pathogens,K. aerogenespopulation structure, genetic diversity, and pathogenicity remain poorly understood. Here, we elucidateK. aerogenesclonal complexes (CCs) and genomic features associated with resistance and virulence. We present a detailed description of the population structure ofK. aerogenesbased on 97 publicly available genomes by using both, multilocus sequence typing and single nucleotide polymorphisms extracted from core genome. We also assessed virulence and resistance profiles using VFDB and CARD, respectively. We show thatK. aerogeneshas an open pangenome and a large effective population size, which account for its high genomic diversity and support that negative selection prevents fixation of most deleterious alleles. The population is structured in at least ten CCs, including two novel ones identified here, CC9 and CC10. The repertoires of resistance genes comprise a high number of antibiotic efflux proteins as well as narrow and extended spectrum β-lactamases. Regarding the population structure, we identified two clusters based on virulence profile due to the presence of the toxin-encodingclboperon and the siderophore production genes,irpandybt.Notably, CC3 comprises the majority ofK. aerogenesisolates associated with hospital outbreaks, emphasizing the importance of its constant monitoring. Collectively, our results can be useful in the development of new therapeutic and surveillance strategies worldwide.


2019 ◽  
Author(s):  
Sierra S Nishizaki ◽  
Natalie Ng ◽  
Shengcheng Dong ◽  
Robert S Porter ◽  
Cody Morterud ◽  
...  

Abstract Motivation Genome-wide association studies have revealed that 88% of disease-associated single-nucleotide polymorphisms (SNPs) reside in noncoding regions. However, noncoding SNPs remain understudied, partly because they are challenging to prioritize for experimental validation. To address this deficiency, we developed the SNP effect matrix pipeline (SEMpl). Results SEMpl estimates transcription factor-binding affinity by observing differences in chromatin immunoprecipitation followed by deep sequencing signal intensity for SNPs within functional transcription factor-binding sites (TFBSs) genome-wide. By cataloging the effects of every possible mutation within the TFBS motif, SEMpl can predict the consequences of SNPs to transcription factor binding. This knowledge can be used to identify potential disease-causing regulatory loci. Availability and implementation SEMpl is available from https://github.com/Boyle-Lab/SEM_CPP. Supplementary information Supplementary data are available at Bioinformatics online.


2019 ◽  
Vol 35 (17) ◽  
pp. 3160-3162
Author(s):  
Davoud Torkamaneh ◽  
Jérôme Laroche ◽  
Istvan Rajcan ◽  
François Belzile

Abstract Motivation Reduced-representation sequencing is a genome-wide scanning method for simultaneous discovery and genotyping of thousands to millions of single nucleotide polymorphisms that is used across a wide range of species. However, in this method a reproducible but very small fraction of the genome is captured for sequencing, while the resulting reads are typically aligned against the entire reference genome. Results Here we present a skinny reference genome approach in which a simplified reference genome is used to decrease computing time for data processing and to increase single nucleotide polymorphism counts and accuracy. A skinny reference genome can be integrated into any reduced-representation sequencing analytical pipeline. Availability and implementation https://bitbucket.org/jerlar73/SRG-Extractor. Supplementary information Supplementary data are available at Bioinformatics online.


2020 ◽  
Vol 36 (19) ◽  
pp. 4957-4959
Author(s):  
David B Blumenthal ◽  
Lorenzo Viola ◽  
Markus List ◽  
Jan Baumbach ◽  
Paolo Tieri ◽  
...  

Abstract Summary Simulated data are crucial for evaluating epistasis detection tools in genome-wide association studies. Existing simulators are limited, as they do not account for linkage disequilibrium (LD), support limited interaction models of single nucleotide polymorphisms (SNPs) and only dichotomous phenotypes or depend on proprietary software. In contrast, EpiGEN supports SNP interactions of arbitrary order, produces realistic LD patterns and generates both categorical and quantitative phenotypes. Availability and implementation EpiGEN is implemented in Python 3 and is freely available at https://github.com/baumbachlab/epigen. Supplementary information Supplementary data are available at Bioinformatics online.


Author(s):  
Rebecca M. Davidson ◽  
Sophie E. Nick ◽  
Sara M. Kammlade ◽  
Sruthi Vasireddy ◽  
Natalia Weakly ◽  
...  

Whole genome sequencing (WGS) has recently been used to investigate acquisition of Mycobacterium abscessus (MABC). Investigators have reached conflicting conclusions about the meaning of genetic distances for interpretation of person-to-person transmission. Existing genomic studies were limited by a lack of WGS from environmental MABC isolates. In this study, we retrospectively analyzed the core and accessory genomes of 26 M. abscessus subsp. abscessus (MAA) isolates collected over seven years. Clinical isolates (n=22) were obtained from a large hospital-associated outbreak of MAA, the outbreak hospital before or after the outbreak, a neighboring hospital, and two outside laboratories. Environmental MAA isolates (n=4) were obtained from outbreak hospital water outlets. Phylogenomic analysis of study isolates revealed three clades with pairwise genetic distances ranging from 0–135 single nucleotide polymorphisms (SNPs). Compared to a reference environmental outbreak isolate, all seven clinical outbreak isolates and the remaining three environmental isolates had highly similar core and accessory genomes, differing by up to 7 SNPs and a median of 1.6% accessory genes, respectively. Although genomic comparisons of 15 non-outbreak clinical isolates revealed greater heterogeneity, five (33%) isolates had fewer than 20 SNPs compared to the reference environmental isolate, including two unrelated outside laboratory isolates with less than 4% accessory genome variation. Detailed genomic comparisons confirmed environmental acquisition of outbreak isolates of MAA. SNP distances alone, however, did not clearly differentiate the mechanism of acquisition of outbreak versus non-outbreak isolates. We conclude that successful investigation of MAA clusters requires molecular and epidemiologic components, ideally complemented by environmental sampling.


Sign in / Sign up

Export Citation Format

Share Document