scholarly journals ganon: precise metagenomics classification against large and up-to-date sets of reference sequences

2018 ◽  
Author(s):  
Vitor C. Piro ◽  
Temesgen H. Dadi ◽  
Enrico Seiler ◽  
Knut Reinert ◽  
Bernhard Y. Renard

AbstractMotivationThe exponential growth of assembled genome sequences greatly benefits metagenomics studies. However, currently available methods struggle to manage the increasing amount of sequences and their frequent updates. Indexing the current RefSeq can take days and hundreds of GB of memory on large servers. Few methods address these issues thus far, and even though many can theoretically handle large amounts of references, time/memory requirements are prohibitive in practice. As a result, many studies that require sequence classification use often outdated and almost never truly up-to-date indices.ResultsMotivated by those limitations we created ganon, a k-mer based read classification tool that uses Interleaved Bloom Filters in conjunction with a taxonomic clustering and a k-mer counting/filtering scheme. Ganon provides an efficient method for indexing references, keeping them updated. It requires less than 55 minutes to index the complete RefSeq of bacteria, archaea, fungi and viruses. The tool can further keep these indices up-to-date in a fraction of the time necessary to create them. Ganon makes it possible to query against very large reference sets and therefore it classifies significantly more reads and identifies more species than similar methods. When classifying a high-complexity CAMI challenge dataset against complete genomes from RefSeq, ganon shows strongly increased precision with equal or better sensitivity compared with state-of-the-art tools. With the same dataset against the complete RefSeq, ganon improved the F1-Score by 65% at the genus level. It supports taxonomy- and assembly-level classification, multiple indices and hierarchical classification.AvailabilityThe software is open-source and available at: https://gitlab.com/rki_bioinformatics/[email protected]


2020 ◽  
Vol 36 (Supplement_1) ◽  
pp. i12-i20 ◽  
Author(s):  
Vitor C Piro ◽  
Temesgen H Dadi ◽  
Enrico Seiler ◽  
Knut Reinert ◽  
Bernhard Y Renard

Abstract Motivation The exponential growth of assembled genome sequences greatly benefits metagenomics studies. However, currently available methods struggle to manage the increasing amount of sequences and their frequent updates. Indexing the current RefSeq can take days and hundreds of GB of memory on large servers. Few methods address these issues thus far, and even though many can theoretically handle large amounts of references, time/memory requirements are prohibitive in practice. As a result, many studies that require sequence classification use often outdated and almost never truly up-to-date indices. Results Motivated by those limitations, we created ganon, a k-mer-based read classification tool that uses Interleaved Bloom Filters in conjunction with a taxonomic clustering and a k-mer counting/filtering scheme. Ganon provides an efficient method for indexing references, keeping them updated. It requires <55 min to index the complete RefSeq of bacteria, archaea, fungi and viruses. The tool can further keep these indices up-to-date in a fraction of the time necessary to create them. Ganon makes it possible to query against very large reference sets and therefore it classifies significantly more reads and identifies more species than similar methods. When classifying a high-complexity CAMI challenge dataset against complete genomes from RefSeq, ganon shows strongly increased precision with equal or better sensitivity compared with state-of-the-art tools. With the same dataset against the complete RefSeq, ganon improved the F1-score by 65% at the genus level. It supports taxonomy- and assembly-level classification, multiple indices and hierarchical classification. Availability and implementation The software is open-source and available at: https://gitlab.com/rki_bioinformatics/ganon. Supplementary information Supplementary data are available at Bioinformatics online.



2019 ◽  
Author(s):  
David Pellow ◽  
Itzik Mizrahi ◽  
Ron Shamir

AbstractBackgroundMany bacteria contain plasmids, but separating between contigs that originate on the plasmid and those that are part of the bacterial genome can be difficult. This is especially true in metagenomic assembly, which yields many contigs of unknown origin. Existing tools for classifying sequences of plasmid origin give less reliable results for shorter sequences, are trained using a fraction of the known plasmids, and can be difficult to use in practice.ResultsWe present PlasClass, a new plasmid classifier. It uses a set of standard classifiers trained on the most current set of known plasmid sequences for different sequence lengths. PlasClass outperforms the state-of-the-art plasmid classification tool on shorter sequences, which constitute the majority of assembly contigs, while using less time and memory.ConclusionsPlasClass can be used to easily classify plasmid and bacterial genome sequences in metagenomic or isolate assemblies. It is available from: https://github.com/Shamir-Lab/PlasClass



2018 ◽  
Vol 7 (22) ◽  
Author(s):  
Teng Long ◽  
Po Yee Wong ◽  
Wendy C. S. Ho ◽  
Robert D. Burk ◽  
Paul K. S. Chan ◽  
...  

The complete genomes of six Macaca mulatta papillomavirus types isolated from genital sites of rhesus monkeys were characterized, and less than 72% identity with the complete L1 genes of known papillomaviruses was found. Macaca mulatta papillomavirus type 2 (MmPV2), MmPV3, and MmPV6 cluster into the genus Alphapapillomavirus, and MmPV4, MmPV5, and MmPV7 cluster into the genus Gammapapillomavirus.



2004 ◽  
Vol 1 (1) ◽  
pp. 131-142
Author(s):  
Ljupčo Todorovski ◽  
Sašo Džeroski ◽  
Peter Ljubič

Both equation discovery and regression methods aim at inducing models of numerical data. While the equation discovery methods are usually evaluated in terms of comprehensibility of the induced model, the emphasis of the regression methods evaluation is on their predictive accuracy. In this paper, we present Ciper, an efficient method for discovery of polynomial equations and empirically evaluate its predictive performance on standard regression tasks. The evaluation shows that polynomials compare favorably to linear and piecewise regression models, induced by the existing state-of-the-art regression methods, in terms of degree of fit and complexity.



2018 ◽  
Vol 7 (11) ◽  
Author(s):  
Yuhuan Qiu ◽  
Zehui Zhao ◽  
Jianming Qiu

We are reporting the sequences of seven complete genomes of parvovirus B19, which were extracted from human plasma specimens collected in the United States. The seven B19 genome sequences, which are 5,596 nucleotides long and include the inverted terminal repeats (ITRs), share an identity of 96.73%.



2018 ◽  
Author(s):  
Avantika Lal ◽  
Keli Liu ◽  
Robert Tibshirani ◽  
Arend Sidow ◽  
Daniele Ramazzotti

AbstractCancer is the result of mutagenic processes that can be inferred from tumor genomes by analyzing rate spectra of point mutations, or “mutational signatures”. Here we present SparseSignatures, a novel framework to extract signatures from somatic point mutation data. Our approach incorporates DNA replication error as a background, employs regularization to reduce noise in non-background signatures, uses cross-validation to identify the number of signatures, and is scalable to large datasets. We show that SparseSignatures outperforms current state-of-the-art methods on simulated data using standard metrics. We then apply SparseSignatures to whole genome sequences of 147 tumors from pancreatic cancer, discovering 8 signatures in addition to the background.



2016 ◽  
Vol 57 ◽  
pp. 421-464 ◽  
Author(s):  
Arnaud Malapert ◽  
Jean-Charles Régin ◽  
Mohamed Rezgui

We introduce an Embarrassingly Parallel Search (EPS) method for solving constraint problems in parallel, and we show that this method matches or even outperforms state-of-the-art algorithms on a number of problems using various computing infrastructures. EPS is a simple method in which a master decomposes the problem into many disjoint subproblems which are then solved independently by workers. Our approach has three advantages: it is an efficient method; it involves almost no communication or synchronization between workers; and its implementation is made easy because the master and the workers rely on an underlying constraint solver, but does not require to modify it. This paper describes the method, and its applications to various constraint problems (satisfaction, enumeration, optimization). We show that our method can be adapted to different underlying solvers (Gecode, Choco2, OR-tools) on different computing infrastructures (multi-core, data centers, cloud computing). The experiments cover unsatisfiable, enumeration and optimization problems, but do not cover first solution search because it makes the results hard to analyze. The same variability can be observed for optimization problems, but at a lesser extent because the optimality proof is required. EPS offers good average performance, and matches or outperforms other available parallel implementations of Gecode as well as some solvers portfolios. Moreover, we perform an in-depth analysis of the various factors that make this approach efficient as well as the anomalies that can occur. Last, we show that the decomposition is a key component for efficiency and load balancing.



2018 ◽  
Vol 6 (4) ◽  
Author(s):  
Lindsey A. Moser ◽  
Lauren M. Oldfield ◽  
Nadia Fedorova ◽  
Vinita Puri ◽  
Susmita Shrivastava ◽  
...  

ABSTRACT We report 26 complete genomes of Zika virus (ZIKV) isolated after passaging the Zika virus strain FLR in mosquito (C6/36) and mammalian (Vero) cell lines. The consensus ZIKV genomes we recovered show greater than 99% nucleotide identify with each other and with the FLR strain used as input.



Complexity ◽  
2020 ◽  
Vol 2020 ◽  
pp. 1-19
Author(s):  
Xinming Zhang ◽  
Doudou Wang ◽  
Haiyan Chen ◽  
Wentao Mao ◽  
Shangwang Liu ◽  
...  

Laplacian Biogeography-Based Optimization (LxBBO) is a BBO variant which improves BBO’s performance largely. When it solves some complex problems, however, it has some drawbacks such as poor performance, weak operability, and high complexity, so an improved LxBBO (ILxBBO) is proposed. First, a two-global-best guiding operator is created for guiding the worst habitat mainly to enhance the exploitation of LxBBO. Second, a dynamic two-differential perturbing operator is proposed for the first two best habitats’ updating to improve the global search ability in the early search phase and the local one in the late search one, respectively. Third, an improved Laplace migration operator is formulated for other habitats’ updating to improve the search ability and the operability. Finally, some measures such as example learning, mutation operation removing, and greedy selection are adopted mostly to reduce the computation complexity of LxBBO. A lot of experimental results on the complex functions from the CEC-2013 test set show ILxBBO obtains better performance than LxBBO and quite a few state-of-the-art algorithms do. Also, the results on Quadratic Assignment Problems (QAPs) show that ILxBBO is more competitive compared with LxBBO, Improved Particle Swarm Optimization (IPSO), and Improved Firefly Algorithm (IFA).



Sign in / Sign up

Export Citation Format

Share Document