ganon: precise metagenomics classification against large and up-to-date sets of reference sequences

Mapping Intimacies ◽

10.1101/406017 ◽

2018 ◽

Cited By ~ 1

Author(s):

Vitor C. Piro ◽

Temesgen H. Dadi ◽

Enrico Seiler ◽

Knut Reinert ◽

Bernhard Y. Renard

Keyword(s):

Efficient Method ◽

State Of The Art ◽

Hierarchical Classification ◽

Bloom Filters ◽

Sequence Classification ◽

High Complexity ◽

Genome Sequences ◽

Complete Genomes ◽

Reference Sequences ◽

Classification Tool

AbstractMotivationThe exponential growth of assembled genome sequences greatly benefits metagenomics studies. However, currently available methods struggle to manage the increasing amount of sequences and their frequent updates. Indexing the current RefSeq can take days and hundreds of GB of memory on large servers. Few methods address these issues thus far, and even though many can theoretically handle large amounts of references, time/memory requirements are prohibitive in practice. As a result, many studies that require sequence classification use often outdated and almost never truly up-to-date indices.ResultsMotivated by those limitations we created ganon, a k-mer based read classification tool that uses Interleaved Bloom Filters in conjunction with a taxonomic clustering and a k-mer counting/filtering scheme. Ganon provides an efficient method for indexing references, keeping them updated. It requires less than 55 minutes to index the complete RefSeq of bacteria, archaea, fungi and viruses. The tool can further keep these indices up-to-date in a fraction of the time necessary to create them. Ganon makes it possible to query against very large reference sets and therefore it classifies significantly more reads and identifies more species than similar methods. When classifying a high-complexity CAMI challenge dataset against complete genomes from RefSeq, ganon shows strongly increased precision with equal or better sensitivity compared with state-of-the-art tools. With the same dataset against the complete RefSeq, ganon improved the F1-Score by 65% at the genus level. It supports taxonomy- and assembly-level classification, multiple indices and hierarchical classification.AvailabilityThe software is open-source and available at: https://gitlab.com/rki_bioinformatics/[email protected]

ganon: precise metagenomics classification against large and up-to-date sets of reference sequences

Bioinformatics ◽

10.1093/bioinformatics/btaa458 ◽

2020 ◽

Vol 36 (Supplement_1) ◽

pp. i12-i20 ◽

Cited By ~ 2

Author(s):

Vitor C Piro ◽

Temesgen H Dadi ◽

Enrico Seiler ◽

Knut Reinert ◽

Bernhard Y Renard

Keyword(s):

State Of The Art ◽

Hierarchical Classification ◽

Bloom Filters ◽

Supplementary Information ◽

Sequence Classification ◽

Supplementary Data ◽

High Complexity ◽

Genome Sequences ◽

Reference Sequences ◽

Classification Tool

Abstract Motivation The exponential growth of assembled genome sequences greatly benefits metagenomics studies. However, currently available methods struggle to manage the increasing amount of sequences and their frequent updates. Indexing the current RefSeq can take days and hundreds of GB of memory on large servers. Few methods address these issues thus far, and even though many can theoretically handle large amounts of references, time/memory requirements are prohibitive in practice. As a result, many studies that require sequence classification use often outdated and almost never truly up-to-date indices. Results Motivated by those limitations, we created ganon, a k-mer-based read classification tool that uses Interleaved Bloom Filters in conjunction with a taxonomic clustering and a k-mer counting/filtering scheme. Ganon provides an efficient method for indexing references, keeping them updated. It requires <55 min to index the complete RefSeq of bacteria, archaea, fungi and viruses. The tool can further keep these indices up-to-date in a fraction of the time necessary to create them. Ganon makes it possible to query against very large reference sets and therefore it classifies significantly more reads and identifies more species than similar methods. When classifying a high-complexity CAMI challenge dataset against complete genomes from RefSeq, ganon shows strongly increased precision with equal or better sensitivity compared with state-of-the-art tools. With the same dataset against the complete RefSeq, ganon improved the F1-score by 65% at the genus level. It supports taxonomy- and assembly-level classification, multiple indices and hierarchical classification. Availability and implementation The software is open-source and available at: https://gitlab.com/rki_bioinformatics/ganon. Supplementary information Supplementary data are available at Bioinformatics online.

PlasClass improves plasmid sequence classification

10.1101/783571 ◽

2019 ◽

Cited By ~ 1

Author(s):

David Pellow ◽

Itzik Mizrahi ◽

Ron Shamir

Keyword(s):

State Of The Art ◽

Bacterial Genome ◽

Unknown Origin ◽

The State ◽

Sequence Classification ◽

Genome Sequences ◽

Plasmid Sequence ◽

Link Type ◽

Classification Tool ◽

Metagenomic Assembly

AbstractBackgroundMany bacteria contain plasmids, but separating between contigs that originate on the plasmid and those that are part of the bacterial genome can be difficult. This is especially true in metagenomic assembly, which yields many contigs of unknown origin. Existing tools for classifying sequences of plasmid origin give less reliable results for shorter sequences, are trained using a fraction of the known plasmids, and can be difficult to use in practice.ResultsWe present PlasClass, a new plasmid classifier. It uses a set of standard classifiers trained on the most current set of known plasmid sequences for different sequence lengths. PlasClass outperforms the state-of-the-art plasmid classification tool on shorter sequences, which constitute the majority of assembly contigs, while using less time and memory.ConclusionsPlasClass can be used to easily classify plasmid and bacterial genome sequences in metagenomic or isolate assemblies. It is available from: https://github.com/Shamir-Lab/PlasClass

Complete Genome Sequences of Six Novel Macaca mulatta Papillomavirus Types Isolated from Genital Sites of Rhesus Monkeys in Hong Kong SAR, China

Microbiology Resource Announcements ◽

10.1128/mra.01414-18 ◽

2018 ◽

Vol 7 (22) ◽

Cited By ~ 1

Author(s):

Teng Long ◽

Po Yee Wong ◽

Wendy C. S. Ho ◽

Robert D. Burk ◽

Paul K. S. Chan ◽

...

Keyword(s):

Hong Kong ◽

Macaca Mulatta ◽

Complete Genome ◽

Rhesus Monkeys ◽

Genome Sequences ◽

Content Type ◽

Complete Genomes ◽

Hong Kong Sar

The complete genomes of six Macaca mulatta papillomavirus types isolated from genital sites of rhesus monkeys were characterized, and less than 72% identity with the complete L1 genes of known papillomaviruses was found. Macaca mulatta papillomavirus type 2 (MmPV2), MmPV3, and MmPV6 cluster into the genus Alphapapillomavirus, and MmPV4, MmPV5, and MmPV7 cluster into the genus Gammapapillomavirus.

Discovery of polynomial equations for regression

Advances in Methodology and Statistics ◽

10.51936/uogl8142 ◽

2004 ◽

Vol 1 (1) ◽

pp. 131-142

Author(s):

Ljupčo Todorovski ◽

Sašo Džeroski ◽

Peter Ljubič

Keyword(s):

Efficient Method ◽

Regression Models ◽

Predictive Accuracy ◽

State Of The Art ◽

Numerical Data ◽

Predictive Performance ◽

Polynomial Equations ◽

Regression Methods ◽

Piecewise Regression ◽

Standard Regression

Both equation discovery and regression methods aim at inducing models of numerical data. While the equation discovery methods are usually evaluated in terms of comprehensibility of the induced model, the emphasis of the regression methods evaluation is on their predictive accuracy. In this paper, we present Ciper, an efficient method for discovery of polynomial equations and empirically evaluate its predictive performance on standard regression tasks. The evaluation shows that polynomials compare favorably to linear and piecewise regression models, induced by the existing state-of-the-art regression methods, in terms of degree of fit and complexity.

Sequences of Seven Complete Genomes of Human Parvovirus B19

Microbiology Resource Announcements ◽

10.1128/mra.00885-18 ◽

2018 ◽

Vol 7 (11) ◽

Author(s):

Yuhuan Qiu ◽

Zehui Zhao ◽

Jianming Qiu

Keyword(s):

United States ◽

Human Plasma ◽

Parvovirus B19 ◽

The United States ◽

Human Parvovirus B19 ◽

Genome Sequences ◽

Complete Genomes ◽

Inverted Terminal Repeats ◽

Terminal Repeats

We are reporting the sequences of seven complete genomes of parvovirus B19, which were extracted from human plasma specimens collected in the United States. The seven B19 genome sequences, which are 5,596 nucleotides long and include the inverted terminal repeats (ITRs), share an identity of 96.73%.

Encoding Hierarchical Classification Codes for Privacy-Preserving Record Linkage Using Bloom Filters

Machine Learning and Knowledge Discovery in Databases - Communications in Computer and Information Science ◽

10.1007/978-3-030-43887-6_12 ◽

2020 ◽

pp. 142-156

Author(s):

Rainer Schnell ◽

Christian Borgs

Keyword(s):

Record Linkage ◽

Hierarchical Classification ◽

Privacy Preserving ◽

Bloom Filters

De Novo Mutational Signature Discovery in Tumor Genomes using SparseSignatures

10.1101/384834 ◽

2018 ◽

Cited By ~ 5

Author(s):

Avantika Lal ◽

Keli Liu ◽

Robert Tibshirani ◽

Arend Sidow ◽

Daniele Ramazzotti

Keyword(s):

Cross Validation ◽

De Novo ◽

State Of The Art ◽

Point Mutations ◽

Simulated Data ◽

Large Datasets ◽

Genome Sequences ◽

Mutational Signatures ◽

Mutational Signature ◽

Current State

AbstractCancer is the result of mutagenic processes that can be inferred from tumor genomes by analyzing rate spectra of point mutations, or “mutational signatures”. Here we present SparseSignatures, a novel framework to extract signatures from somatic point mutation data. Our approach incorporates DNA replication error as a background, employs regularization to reduce noise in non-background signatures, uses cross-validation to identify the number of signatures, and is scalable to large datasets. We show that SparseSignatures outperforms current state-of-the-art methods on simulated data using standard metrics. We then apply SparseSignatures to whole genome sequences of 147 tumors from pancreatic cancer, discovering 8 signatures in addition to the background.

Embarrassingly Parallel Search in Constraint Programming

Journal of Artificial Intelligence Research ◽

10.1613/jair.5247 ◽

2016 ◽

Vol 57 ◽

pp. 421-464 ◽

Cited By ~ 13

Author(s):

Arnaud Malapert ◽

Jean-Charles Régin ◽

Mohamed Rezgui

Keyword(s):

Cloud Computing ◽

Efficient Method ◽

Data Centers ◽

Optimization Problems ◽

State Of The Art ◽

Parallel Search ◽

Simple Method ◽

Constraint Solver ◽

Average Performance ◽

Depth Analysis

We introduce an Embarrassingly Parallel Search (EPS) method for solving constraint problems in parallel, and we show that this method matches or even outperforms state-of-the-art algorithms on a number of problems using various computing infrastructures. EPS is a simple method in which a master decomposes the problem into many disjoint subproblems which are then solved independently by workers. Our approach has three advantages: it is an efficient method; it involves almost no communication or synchronization between workers; and its implementation is made easy because the master and the workers rely on an underlying constraint solver, but does not require to modify it. This paper describes the method, and its applications to various constraint problems (satisfaction, enumeration, optimization). We show that our method can be adapted to different underlying solvers (Gecode, Choco2, OR-tools) on different computing infrastructures (multi-core, data centers, cloud computing). The experiments cover unsatisfiable, enumeration and optimization problems, but do not cover first solution search because it makes the results hard to analyze. The same variability can be observed for optimization problems, but at a lesser extent because the optimality proof is required. EPS offers good average performance, and matches or outperforms other available parallel implementations of Gecode as well as some solvers portfolios. Moreover, we perform an in-depth analysis of the various factors that make this approach efficient as well as the anomalies that can occur. Last, we show that the decomposition is a key component for efficiency and load balancing.

Whole-Genome Sequences of Zika Virus FLR Strains after Passage in Vero or C6/36 Cells

Genome Announcements ◽

10.1128/genomea.01528-17 ◽

2018 ◽

Vol 6 (4) ◽

Cited By ~ 1

Author(s):

Lindsey A. Moser ◽

Lauren M. Oldfield ◽

Nadia Fedorova ◽

Vinita Puri ◽

Susmita Shrivastava ◽

...

Keyword(s):

Vero Cell ◽

Cell Lines ◽

Virus Strain ◽

Zika Virus ◽

Whole Genome ◽

Genome Sequences ◽

Complete Genomes ◽

Vero Cell Lines

ABSTRACT We report 26 complete genomes of Zika virus (ZIKV) isolated after passaging the Zika virus strain FLR in mosquito (C6/36) and mammalian (Vero) cell lines. The consensus ZIKV genomes we recovered show greater than 99% nucleotide identify with each other and with the FLR strain used as input.

Improved Laplacian Biogeography-Based Optimization Algorithm and Its Application to QAP

Complexity ◽

10.1155/2020/7824785 ◽

2020 ◽

Vol 2020 ◽

pp. 1-19

Author(s):

Xinming Zhang ◽

Doudou Wang ◽

Haiyan Chen ◽

Wentao Mao ◽

Shangwang Liu ◽

...

Keyword(s):

Firefly Algorithm ◽

State Of The Art ◽

Poor Performance ◽

Quadratic Assignment ◽

High Complexity ◽

Computation Complexity ◽

Swarm Optimization ◽

Improved Firefly Algorithm ◽

Complex Functions ◽

Quadratic Assignment Problems

Laplacian Biogeography-Based Optimization (LxBBO) is a BBO variant which improves BBO’s performance largely. When it solves some complex problems, however, it has some drawbacks such as poor performance, weak operability, and high complexity, so an improved LxBBO (ILxBBO) is proposed. First, a two-global-best guiding operator is created for guiding the worst habitat mainly to enhance the exploitation of LxBBO. Second, a dynamic two-differential perturbing operator is proposed for the first two best habitats’ updating to improve the global search ability in the early search phase and the local one in the late search one, respectively. Third, an improved Laplace migration operator is formulated for other habitats’ updating to improve the search ability and the operability. Finally, some measures such as example learning, mutation operation removing, and greedy selection are adopted mostly to reduce the computation complexity of LxBBO. A lot of experimental results on the complex functions from the CEC-2013 test set show ILxBBO obtains better performance than LxBBO and quite a few state-of-the-art algorithms do. Also, the results on Quadratic Assignment Problems (QAPs) show that ILxBBO is more competitive compared with LxBBO, Improved Particle Swarm Optimization (IPSO), and Improved Firefly Algorithm (IFA).