PlasClass improves plasmid sequence classification

Mapping Intimacies ◽

10.1101/783571 ◽

2019 ◽

Cited By ~ 1

Author(s):

David Pellow ◽

Itzik Mizrahi ◽

Ron Shamir

Keyword(s):

State Of The Art ◽

Bacterial Genome ◽

Unknown Origin ◽

The State ◽

Sequence Classification ◽

Genome Sequences ◽

Plasmid Sequence ◽

Link Type ◽

Classification Tool ◽

Metagenomic Assembly

AbstractBackgroundMany bacteria contain plasmids, but separating between contigs that originate on the plasmid and those that are part of the bacterial genome can be difficult. This is especially true in metagenomic assembly, which yields many contigs of unknown origin. Existing tools for classifying sequences of plasmid origin give less reliable results for shorter sequences, are trained using a fraction of the known plasmids, and can be difficult to use in practice.ResultsWe present PlasClass, a new plasmid classifier. It uses a set of standard classifiers trained on the most current set of known plasmid sequences for different sequence lengths. PlasClass outperforms the state-of-the-art plasmid classification tool on shorter sequences, which constitute the majority of assembly contigs, while using less time and memory.ConclusionsPlasClass can be used to easily classify plasmid and bacterial genome sequences in metagenomic or isolate assemblies. It is available from: https://github.com/Shamir-Lab/PlasClass

ganon: precise metagenomics classification against large and up-to-date sets of reference sequences

Bioinformatics ◽

10.1093/bioinformatics/btaa458 ◽

2020 ◽

Vol 36 (Supplement_1) ◽

pp. i12-i20 ◽

Cited By ~ 2

Author(s):

Vitor C Piro ◽

Temesgen H Dadi ◽

Enrico Seiler ◽

Knut Reinert ◽

Bernhard Y Renard

Keyword(s):

State Of The Art ◽

Hierarchical Classification ◽

Bloom Filters ◽

Supplementary Information ◽

Sequence Classification ◽

Supplementary Data ◽

High Complexity ◽

Genome Sequences ◽

Reference Sequences ◽

Classification Tool

Abstract Motivation The exponential growth of assembled genome sequences greatly benefits metagenomics studies. However, currently available methods struggle to manage the increasing amount of sequences and their frequent updates. Indexing the current RefSeq can take days and hundreds of GB of memory on large servers. Few methods address these issues thus far, and even though many can theoretically handle large amounts of references, time/memory requirements are prohibitive in practice. As a result, many studies that require sequence classification use often outdated and almost never truly up-to-date indices. Results Motivated by those limitations, we created ganon, a k-mer-based read classification tool that uses Interleaved Bloom Filters in conjunction with a taxonomic clustering and a k-mer counting/filtering scheme. Ganon provides an efficient method for indexing references, keeping them updated. It requires <55 min to index the complete RefSeq of bacteria, archaea, fungi and viruses. The tool can further keep these indices up-to-date in a fraction of the time necessary to create them. Ganon makes it possible to query against very large reference sets and therefore it classifies significantly more reads and identifies more species than similar methods. When classifying a high-complexity CAMI challenge dataset against complete genomes from RefSeq, ganon shows strongly increased precision with equal or better sensitivity compared with state-of-the-art tools. With the same dataset against the complete RefSeq, ganon improved the F1-score by 65% at the genus level. It supports taxonomy- and assembly-level classification, multiple indices and hierarchical classification. Availability and implementation The software is open-source and available at: https://gitlab.com/rki_bioinformatics/ganon. Supplementary information Supplementary data are available at Bioinformatics online.

ganon: precise metagenomics classification against large and up-to-date sets of reference sequences

10.1101/406017 ◽

2018 ◽

Cited By ~ 1

Author(s):

Vitor C. Piro ◽

Temesgen H. Dadi ◽

Enrico Seiler ◽

Knut Reinert ◽

Bernhard Y. Renard

Keyword(s):

Efficient Method ◽

State Of The Art ◽

Hierarchical Classification ◽

Bloom Filters ◽

Sequence Classification ◽

High Complexity ◽

Genome Sequences ◽

Complete Genomes ◽

Reference Sequences ◽

Classification Tool

AbstractMotivationThe exponential growth of assembled genome sequences greatly benefits metagenomics studies. However, currently available methods struggle to manage the increasing amount of sequences and their frequent updates. Indexing the current RefSeq can take days and hundreds of GB of memory on large servers. Few methods address these issues thus far, and even though many can theoretically handle large amounts of references, time/memory requirements are prohibitive in practice. As a result, many studies that require sequence classification use often outdated and almost never truly up-to-date indices.ResultsMotivated by those limitations we created ganon, a k-mer based read classification tool that uses Interleaved Bloom Filters in conjunction with a taxonomic clustering and a k-mer counting/filtering scheme. Ganon provides an efficient method for indexing references, keeping them updated. It requires less than 55 minutes to index the complete RefSeq of bacteria, archaea, fungi and viruses. The tool can further keep these indices up-to-date in a fraction of the time necessary to create them. Ganon makes it possible to query against very large reference sets and therefore it classifies significantly more reads and identifies more species than similar methods. When classifying a high-complexity CAMI challenge dataset against complete genomes from RefSeq, ganon shows strongly increased precision with equal or better sensitivity compared with state-of-the-art tools. With the same dataset against the complete RefSeq, ganon improved the F1-Score by 65% at the genus level. It supports taxonomy- and assembly-level classification, multiple indices and hierarchical classification.AvailabilityThe software is open-source and available at: https://gitlab.com/rki_bioinformatics/[email protected]

Complex Data Imputation by Auto-Encoders and Convolutional Neural Networks—A Case Study on Genome Gap-Filling

Computers ◽

10.3390/computers9020037 ◽

2020 ◽

Vol 9 (2) ◽

pp. 37 ◽

Cited By ~ 1

Author(s):

Luca Cappelletti ◽

Tommaso Fontana ◽

Guido Walter Di Donato ◽

Lorenzo Di Tucci ◽

Elena Casiraghi ◽

...

Keyword(s):

Deep Learning ◽

Missing Data ◽

State Of The Art ◽

The State ◽

Complex Data ◽

Data Imputation ◽

Genome Sequences ◽

Missing Data Imputation ◽

The Past ◽

Learning Techniques

Missing data imputation has been a hot topic in the past decade, and many state-of-the-art works have been presented to propose novel, interesting solutions that have been applied in a variety of fields. In the past decade, the successful results achieved by deep learning techniques have opened the way to their application for solving difficult problems where human skill is not able to provide a reliable solution. Not surprisingly, some deep learners, mainly exploiting encoder-decoder architectures, have also been designed and applied to the task of missing data imputation. However, most of the proposed imputation techniques have not been designed to tackle “complex data”, that is high dimensional data belonging to datasets with huge cardinality and describing complex problems. Precisely, they often need critical parameters to be manually set or exploit complex architecture and/or training phases that make their computational load impracticable. In this paper, after clustering the state-of-the-art imputation techniques into three broad categories, we briefly review the most representative methods and then describe our data imputation proposals, which exploit deep learning techniques specifically designed to handle complex data. Comparative tests on genome sequences show that our deep learning imputers outperform the state-of-the-art KNN-imputation method when filling gaps in human genome sequences.

GraphAligner: rapid and versatile sequence-to-graph alignment

Genome Biology ◽

10.1186/s13059-020-02157-2 ◽

2020 ◽

Vol 21 (1) ◽

Cited By ~ 1

Author(s):

Mikko Rautiainen ◽

Tobias Marschall

Keyword(s):

Genetic Variation ◽

Error Correction ◽

Genome Assembly ◽

State Of The Art ◽

Source Code ◽

The State ◽

Graph Alignment ◽

Link Type ◽

Long Reads

Abstract Genome graphs can represent genetic variation and sequence uncertainty. Aligning sequences to genome graphs is key to many applications, including error correction, genome assembly, and genotyping of variants in a pangenome graph. Yet, so far, this step is often prohibitively slow. We present GraphAligner, a tool for aligning long reads to genome graphs. Compared to the state-of-the-art tools, GraphAligner is 13x faster and uses 3x less memory. When employing GraphAligner for error correction, we find it to be more than twice as accurate and over 12x faster than extant tools.Availability: Package manager: https://anaconda.org/bioconda/graphalignerand source code: https://github.com/maickrau/GraphAligner

Graphmap2 - splice-aware RNA-seq mapper for long reads

10.1101/720458 ◽

2019 ◽

Cited By ~ 1

Author(s):

Josip Marić ◽

Ivan Sović ◽

Krešimir Križanović ◽

Niranjan Nagarajan ◽

Mile Šikić

Keyword(s):

State Of The Art ◽

The State ◽

Rna Seq ◽

Link Type ◽

Pacific Biosciences ◽

Long Reads ◽

Oxford Nanopore

AbstractIn this paper we present Graphmap2, a splice-aware mapper built on our previously developed DNA mapper Graphmap. Graphmap2 is tailored for long reads produced by Pacific Biosciences and Oxford Nanopore devices. It uses several newly developed algorithms which enable higher precision and recall of correctly detected transcripts and exon boundaries. We compared its performance with the state-of-the-art tools Minimap2 and Gmap. On both simulated and real datasets Graphmap2 achieves higher mappability and more correctly recognized exons and their ends. In addition we present an analysis of potential of splice aware mappers and long reads for the identification of previously unknown isoforms and even genes. The Graphmap2 tool is publicly available at https://github.com/lbcb-sci/graphmap2.

Fast Bayesian Inference of Copy Number Variants using Hidden Markov Models with Wavelet Compression

10.1101/023705 ◽

2015 ◽

Author(s):

John Wiedenhoeft ◽

Eric Brugel ◽

Alexander Schliep

Keyword(s):

Hidden Markov Models ◽

Copy Number ◽

Markov Models ◽

State Of The Art ◽

Hidden Markov ◽

Copy Number Variants ◽

Computational Effort ◽

The State ◽

Haar Wavelets ◽

Link Type

AbstractBy combining Haar wavelets with Bayesian Hidden Markov Models, we improve detection of genomic copy number variants (CNV) in array CGH experiments compared to the state-of-the-art, including standard Gibbs sampling. At the same time, we achieve drastically reduced running times, as the method concentrates computational effort on chromosomal segments which are difficult to call, by dynamically and adaptively recomputing consecutive blocks of observations likely to share a copy number. This makes routine diagnostic use and re-analysis of legacy data collections feasible; to this end, we also propose an effective automatic prior. An open source software implementation of our method is available at http://bioinformatics.rutgers.edu/Software/HaMMLET/. The web supplement is at http://bioinformatics.rutgers.edu/Supplements/HaMMLET/.Author SummaryIdentifying large-scale genome deletions and duplications, or copy number variants (CNV), accurately in populations or individual patients is a crucial step in indicating disease factors or diagnosing an individual patient's disease type. Hidden Markov Models (HMM) are a type of statistical model widely used for CNV detection, as well as other biological applications such as the analysis of gene expression time course data or the analysis of discrete-valued DNA and protein sequences.As with many statistical models, there are two fundamentally different inference approaches. In the frequentist framework, a single estimate of the model parameters would be used as a basis for subsequent inference, making the identification of CNV dependent on the quality of that estimate. This is an acute problem for HMM as methods for finding globally optimal parameters are not known. Alternatively, one can use a Bayesian approach and integrate over all possible parameter choices. While the latter is known to lead to significantly better results, the much—up to hundreds of times—larger computational effort prevents wide adaptation so far.Our proposed method addresses this by combining Haar wavelets and HMM. We greatly accelerate fully Bayesian HMMs, while simultaneously increasing convergence and thus the accuracy of the Gibbs sampler used for Bayesian computations, leading to substantial improvements over the state-of-the-art.

HLA-MA: Simple yet powerful matching of samples using HLA typing results

10.1101/066548 ◽

2016 ◽

Author(s):

Clemens Messerschmidt ◽

Manuel Holtgrewe ◽

Dieter Beule

Keyword(s):

Microsatellite Instability ◽

State Of The Art ◽

The State ◽

Hla Typing ◽

Whole Genome ◽

Consistency Checking ◽

Rna Seq ◽

Simple Method ◽

Link Type ◽

Typing Result

AbstractSummaryWe propose the simple method HLA-MA for consistency checking in pipelines operating on human HTS data. The method is based on the HLA typing result of the state-of-the-art method Opti-Type. Provided that there is sufficient coverage of the HLA loci, comparing HLA types allows for simple, fast, and robust matching of samples from whole genome, exome, and RNA-seq data. This approach is reliable for sample re-identification even for samples with high mutational loads, e.g., caused by microsatellite instability or POLE1 defects.Availability and ImplementationThe software is implemented In Python 3 and freely available under the MIT license at https://github.com/bihealth/hlama and via [email protected]

Practical Picture Processing

Proceedings, annual meeting, Electron Microscopy Society of America ◽

10.1017/s0424820100051700 ◽

1974 ◽

Vol 32 ◽

pp. 338-339

Author(s):

T. A. Welton

Keyword(s):

Radiation Damage ◽

Coherence Length ◽

Spatial Information ◽

State Of The Art ◽

Coherent Radiation ◽

The State ◽

Energy Spread ◽

Electron Micrograph ◽

Picture Processing ◽

Molecular Skeleton

Various authors have emphasized the spatial information resident in an electron micrograph taken with adequately coherent radiation. In view of the completion of at least one such instrument, this opportunity is taken to summarize the state of the art of processing such micrographs. We use the usual symbols for the aberration coefficients, and supplement these with £ and 6 for the transverse coherence length and the fractional energy spread respectively. He also assume a weak, biologically interesting sample, with principal interest lying in the molecular skeleton remaining after obvious hydrogen loss and other radiation damage has occurred.

Hypothesis-Evidence Coordination: The State of the Art

Contemporary Psychology ◽

10.1037/000988 ◽

2003 ◽

Vol 48 (6) ◽

pp. 826-829 ◽

Cited By ~ 1

Author(s):

Eric Amsel

Keyword(s):

State Of The Art ◽

The State

The State of the Art

Contemporary Psychology ◽

10.1037/009537 ◽

1968 ◽

Vol 13 (9) ◽

pp. 479-480

Author(s):

LEWIS PETRINOVICH

Keyword(s):

State Of The Art ◽

The State