Algorithms for Molecular Biology

A new 1.375-approximation algorithm for sorting by transpositions

Algorithms for Molecular Biology ◽

10.1186/s13015-022-00205-z ◽

2022 ◽

Vol 17 (1) ◽

Author(s):

Luiz Augusto G. Silva ◽

Luis Antonio B. Kowada ◽

Noraí Romeu Rocco ◽

Maria Emília M. T. Walter

Keyword(s):

Lower Bound ◽

Approximation Algorithm ◽

Upper Bound ◽

Best Approximation ◽

Time Complexity ◽

Classical Problem ◽

Algebraic Approach ◽

Approximation Ratio ◽

The Best Approximation ◽

Better Than

Abstract Background sorting by transpositions (SBT) is a classical problem in genome rearrangements. In 2012, SBT was proven to be $$\mathcal {NP}$$ NP -hard and the best approximation algorithm with a 1.375 ratio was proposed in 2006 by Elias and Hartman (EH algorithm). Their algorithm employs simplification, a technique used to transform an input permutation $$\pi$$ π into a simple permutation$${\hat{\pi }}$$ π ^ , presumably easier to handle with. The permutation $${\hat{\pi }}$$ π ^ is obtained by inserting new symbols into $$\pi$$ π in a way that the lower bound of the transposition distance of $$\pi$$ π is kept on $${\hat{\pi }}$$ π ^ . The simplification is guaranteed to keep the lower bound, not the transposition distance. A sequence of operations sorting $${\hat{\pi }}$$ π ^ can be mimicked to sort $$\pi$$ π . Results and conclusions First, using an algebraic approach, we propose a new upper bound for the transposition distance, which holds for all $$S_n$$ S n . Next, motivated by a problem identified in the EH algorithm, which causes it, in scenarios involving how the input permutation is simplified, to require one extra transposition above the 1.375-approximation ratio, we propose a new approximation algorithm to solve SBT ensuring the 1.375-approximation ratio for all $$S_n$$ S n . We implemented our algorithm and EH’s. Regarding the implementation of the EH algorithm, two other issues were identified and needed to be fixed. We tested both algorithms against all permutations of size n, $$2\le n \le 12$$ 2 ≤ n ≤ 12 . The results show that the EH algorithm exceeds the approximation ratio of 1.375 for permutations with a size greater than 7. The percentage of computed distances that are equal to transposition distance, computed by the implemented algorithms are also compared with others available in the literature. Finally, we investigate the performance of both implementations on longer permutations of maximum length 500. From the experiments, we conclude that maximum and the average distances computed by our algorithm are a little better than the ones computed by the EH algorithm and the running times of both algorithms are similar, despite the time complexity of our algorithm being higher.

Download Full-text

An optimized FM-index library for nucleotide and amino acid search

Algorithms for Molecular Biology ◽

10.1186/s13015-021-00204-6 ◽

2021 ◽

Vol 16 (1) ◽

Author(s):

Tim Anderson ◽

Travis J. Wheeler

Keyword(s):

Amino Acid ◽

Open Source ◽

Pattern Matching ◽

Suffix Array ◽

Amino Acid Sequences ◽

Lookup Table ◽

Efficient Computation ◽

Biological Sequence ◽

Run Time ◽

High Level

Abstract Background Pattern matching is a key step in a variety of biological sequence analysis pipelines. The FM-index is a compressed data structure for pattern matching, with search run time that is independent of the length of the database text. Implementation of the FM-index is reasonably complicated, so that increased adoption will be aided by the availability of a fast and flexible FM-index library. Results We present AvxWindowedFMindex (AWFM-index), a lightweight, open-source, thread-parallel FM-index library written in C that is optimized for indexing nucleotide and amino acid sequences. AWFM-index introduces a new approach to storing FM-index data in a strided bit-vector format that enables extremely efficient computation of the FM-index occurrence function via AVX2 bitwise instructions, and combines this with optional on-disk storage of the index’s suffix array and a cache-efficient lookup table for partial k-mer searches. The AWFM-index performs exact match count and locate queries faster than SeqAn3’s FM-index implementation across a range of comparable memory footprints. When optimized for speed, AWFM-index is $$\sim $$ ∼ 2–4x faster than SeqAn3 for nucleotide search, and $$\sim $$ ∼ 2–6x faster for amino acid search; it is also $$\sim $$ ∼ 4x faster with similar memory footprint when storing the suffix array in on-disk SSD storage. Conclusions AWFM-index is easy to incorporate into bioinformatics software, offers run-time performance parameterization, and provides clients with FM-index functionality at both a high-level (count or locate all instances of a query string) and low-level (step-wise control of the FM-index backward-search process). The open-source library is available for download at https://github.com/TravisWheelerLab/AvxWindowFmIndex.

Download Full-text

An improved approximation algorithm for the reversal and transposition distance considering gene order and intergenic sizes

Algorithms for Molecular Biology ◽

10.1186/s13015-021-00203-7 ◽

2021 ◽

Vol 16 (1) ◽

Author(s):

Klairton L. Brito ◽

Andre R. Oliveira ◽

Alexsandro O. Alexandrino ◽

Ulisses Dias ◽

Zanoni Dias

Keyword(s):

Approximation Algorithm ◽

Genetic Information ◽

Genome Rearrangement ◽

Simulated Data ◽

Genetic Changes ◽

Greedy Strategy ◽

Approximation Factor ◽

Transposition Event ◽

A Genome ◽

Intergenic Regions

Abstract Background In the comparative genomics field, one of the goals is to estimate a sequence of genetic changes capable of transforming a genome into another. Genome rearrangement events are mutations that can alter the genetic content or the arrangement of elements from the genome. Reversal and transposition are two of the most studied genome rearrangement events. A reversal inverts a segment of a genome while a transposition swaps two consecutive segments. Initial studies in the area considered only the order of the genes. Recent works have incorporated other genetic information in the model. In particular, the information regarding the size of intergenic regions, which are structures between each pair of genes and in the extremities of a linear genome. Results and conclusions In this work, we investigate the sorting by intergenic reversals and transpositions problem on genomes sharing the same set of genes, considering the cases where the orientation of genes is known and unknown. Besides, we explored a variant of the problem, which generalizes the transposition event. As a result, we present an approximation algorithm that guarantees an approximation factor of 4 for both cases considering the reversal and transposition (classic definition) events, an improvement from the 4.5-approximation previously known for the scenario where the orientation of the genes is unknown. We also present a 3-approximation algorithm by incorporating the generalized transposition event, and we propose a greedy strategy to improve the performance of the algorithms. We performed practical tests adopting simulated data which indicated that the algorithms, in both cases, tend to perform better when compared with the best-known algorithms for the problem. Lastly, we conducted experiments using real genomes to demonstrate the applicability of the algorithms.

Download Full-text

A simpler linear-time algorithm for the common refinement of rooted phylogenetic trees on a common leaf set

Algorithms for Molecular Biology ◽

10.1186/s13015-021-00202-8 ◽

2021 ◽

Vol 16 (1) ◽

Author(s):

David Schaller ◽

Marc Hellmuth ◽

Peter F. Stadler

Keyword(s):

Phylogenetic Trees ◽

Linear Time ◽

Time Algorithm ◽

Input Tree ◽

Linear Time Algorithm ◽

Optimal Algorithms ◽

Asymptotically Optimal ◽

The Common ◽

Mathematical Phylogenetics ◽

Special Case

Abstract Background The supertree problem, i.e., the task of finding a common refinement of a set of rooted trees is an important topic in mathematical phylogenetics. The special case of a common leaf set L is known to be solvable in linear time. Existing approaches refine one input tree using information of the others and then test whether the results are isomorphic. Results An O(k|L|) algorithm, , for constructing the common refinement T of k input trees with a common leaf set L is proposed that explicitly computes the parent function of T in a bottom-up approach. Conclusion is simpler to implement than other asymptotically optimal algorithms for the problem and outperforms the alternatives in empirical comparisons. Availability An implementation of in Python is freely available at https://github.com/david-schaller/tralda.

Download Full-text

Testing the agreement of trees with internal labels

Algorithms for Molecular Biology ◽

10.1186/s13015-021-00201-9 ◽

2021 ◽

Vol 16 (1) ◽

Author(s):

David Fernández-Baca ◽

Lei Liu

Keyword(s):

Phylogenetic Trees ◽

Worst Case ◽

Labeled Trees ◽

Number Of Children ◽

Phylogenetic Studies ◽

Overlapping Sets ◽

Special Case ◽

Labeled Tree ◽

Agreement Problem ◽

Better Than

Abstract Background A semi-labeled tree is a tree where all leaves as well as, possibly, some internal nodes are labeled with taxa. Semi-labeled trees encompass ordinary phylogenetic trees and taxonomies. Suppose we are given a collection $${\mathcal {P}}= \{{\mathcal {T}}_1, {\mathcal {T}}_2, \ldots , {\mathcal {T}}_k\}$$ P = { T 1 , T 2 , … , T k } of semi-labeled trees, called input trees, over partially overlapping sets of taxa. The agreement problem asks whether there exists a tree $${\mathcal {T}}$$ T , called an agreement tree, whose taxon set is the union of the taxon sets of the input trees such that the restriction of $${\mathcal {T}}$$ T to the taxon set of $${\mathcal {T}}_i$$ T i is isomorphic to $${\mathcal {T}}_i$$ T i , for each $$i \in \{1, 2, \ldots , k\}$$ i ∈ { 1 , 2 , … , k } . The agreement problems is a special case of the supertree problem, the problem of synthesizing a collection of phylogenetic trees with partially overlapping taxon sets into a single supertree that represents the information in the input trees. An obstacle to building large phylogenetic supertrees is the limited amount of taxonomic overlap among the phylogenetic studies from which the input trees are obtained. Incorporating taxonomies into supertree analyses can alleviate this issue. Results We give a $${\mathcal {O}}(n k (\sum _{i \in [k]} d_i + \log ^2(nk)))$$ O ( n k ( ∑ i ∈ [ k ] d i + log 2 ( n k ) ) ) algorithm for the agreement problem, where n is the total number of distinct taxa in $${\mathcal {P}}$$ P , k is the number of trees in $${\mathcal {P}}$$ P , and $$d_i$$ d i is the maximum number of children of a node in $${\mathcal {T}}_i$$ T i . Conclusion Our algorithm can aid in integrating taxonomies into supertree analyses. Our computational experience with the algorithm suggests that its performance in practice is much better than its worst-case bound indicates.

Download Full-text

Approximation algorithm for rearrangement distances considering repeated genes and intergenic regions

Algorithms for Molecular Biology ◽

10.1186/s13015-021-00200-w ◽

2021 ◽

Vol 16 (1) ◽

Author(s):

Gabriel Siqueira ◽

Alexsandro Oliveira Alexandrino ◽

Andre Rodrigues Oliveira ◽

Zanoni Dias

Keyword(s):

Approximation Algorithm ◽

A Genome ◽

Intergenic Regions ◽

Genome Representation ◽

Repeated Genes

AbstractThe rearrangement distance is a method to compare genomes of different species. Such distance is the number of rearrangement events necessary to transform one genome into another. Two commonly studied events are the transposition, which exchanges two consecutive blocks of the genome, and the reversal, which reverts a block of the genome. When dealing with such problems, seminal works represented genomes as sequences of genes without repetition. More realistic models started to consider gene repetition or the presence of intergenic regions, sequences of nucleotides between genes and in the extremities of the genome. This work explores the transposition and reversal events applied in a genome representation considering both gene repetition and intergenic regions. We define two problems called Minimum Common Intergenic String Partition and Reverse Minimum Common Intergenic String Partition. Using a relation with these two problems, we show a $$\Theta \left( k \right)$$ Θ k -approximation for the Intergenic Transposition Distance, the Intergenic Reversal Distance, and the Intergenic Reversal and Transposition Distance problems, where k is the maximum number of copies of a gene in the genomes. Our practical experiments on simulated genomes show that the use of partitions improves the estimates for the distances.

Download Full-text

DeepGRP: engineering a software tool for predicting genomic repetitive elements using Recurrent Neural Networks with attention

Algorithms for Molecular Biology ◽

10.1186/s13015-021-00199-0 ◽

2021 ◽

Vol 16 (1) ◽

Author(s):

Fabian Hausmann ◽

Stefan Kurtz

Keyword(s):

Machine Translation ◽

De Novo ◽

Repetitive Sequences ◽

Software Tool ◽

Repetitive Elements ◽

Training Data ◽

Implementation Framework ◽

Neural Machine Translation ◽

Species Specific

Abstract Background Repetitive elements contribute a large part of eukaryotic genomes. For example, about 40 to 50% of human, mouse and rat genomes are repetitive. So identifying and classifying repeats is an important step in genome annotation. This annotation step is traditionally performed using alignment based methods, either in a de novo approach or by aligning the genome sequence to a species specific set of repetitive sequences. Recently, Li (Bioinformatics 35:4408–4410, 2019) developed a novel software tool to annotate repetitive sequences using a recurrent neural network trained on sample annotations of repetitive elements. Results We have developed the methods of further and engineered a new software tool . This combines the basic concepts of Li (Bioinformatics 35:4408–4410, 2019) with current techniques developed for neural machine translation, the attention mechanism, for the task of nucleotide-level annotation of repetitive elements. An evaluation on the human genome shows a 20% improvement of the Matthews correlation coefficient for the predictions delivered by , when compared to . predicts two additional classes of repeats (compared to ) and is able to transfer repeat annotations, using RepeatMasker-based training data to a different species (mouse). Additionally, we could show that predicts repeats annotated in the Dfam database, but not annotated by RepeatMasker. is highly scalable due to its implementation in the TensorFlow framework. For example, the GPU-accelerated version of is approx. 1.8 times faster than , approx. 8.6 times faster than RepeatMasker and over 100 times faster than HMMER searching for models of the Dfam database. Conclusions By incorporating methods from neural machine translation, achieves a consistent improvement of the quality of the predictions compared to . Improved running times are obtained by employing TensorFlow as implementation framework and the use of GPUs. By incorporating two additional classes of repeats, provides more complete annotations, which were evaluated against three state-of-the-art tools for repeat annotation.

Download Full-text

Heuristic algorithms for best match graph editing

Algorithms for Molecular Biology ◽

10.1186/s13015-021-00196-3 ◽

2021 ◽

Vol 16 (1) ◽

Author(s):

David Schaller ◽

Manuela Geiß ◽

Marc Hellmuth ◽

Peter F. Stadler

Keyword(s):

Heuristic Algorithms ◽

Sequence Data ◽

Similarity Measures ◽

Set Partitioning ◽

Attractive Alternative ◽

Biological Sequence ◽

Detection Algorithms ◽

Empirical Estimates ◽

Mathematical Phylogenetics ◽

Multiple Species

Abstract Background Best match graphs (BMGs) are a class of colored digraphs that naturally appear in mathematical phylogenetics as a representation of the pairwise most closely related genes among multiple species. An arc connects a gene x with a gene y from another species (vertex color) Y whenever it is one of the phylogenetically closest relatives of x. BMGs can be approximated with the help of similarity measures between gene sequences, albeit not without errors. Empirical estimates thus will usually violate the theoretical properties of BMGs. The corresponding graph editing problem can be used to guide error correction for best match data. Since the arc set modification problems for BMGs are NP-complete, efficient heuristics are needed if BMGs are to be used for the practical analysis of biological sequence data. Results Since BMGs have a characterization in terms of consistency of a certain set of rooted triples (binary trees on three vertices) defined on the set of genes, we consider heuristics that operate on triple sets. As an alternative, we show that there is a close connection to a set partitioning problem that leads to a class of top-down recursive algorithms that are similar to Aho’s supertree algorithm and give rise to BMG editing algorithms that are consistent in the sense that they leave BMGs invariant. Extensive benchmarking shows that community detection algorithms for the partitioning steps perform best for BMG editing. Conclusion Noisy BMG data can be corrected with sufficient accuracy and efficiency to make BMGs an attractive alternative to classical phylogenetic methods.

Download Full-text

A novel method for inference of acyclic chemical compounds with bounded branch-height based on artificial neural networks and integer programming

Algorithms for Molecular Biology ◽

10.1186/s13015-021-00197-2 ◽

2021 ◽

Vol 16 (1) ◽

Author(s):

Naveed Ahmed Azam ◽

Jianshen Zhu ◽

Yanming Sun ◽

Yu Shi ◽

Aleksandar Shurbevski ◽

...

Keyword(s):

Search Algorithm ◽

Chemical Compounds ◽

Mixed Integer ◽

Graph Search ◽

Second Phase ◽

Hydrogen Atoms ◽

Chemical Structures ◽

Prediction Function ◽

Acyclic Graphs ◽

Artificial Neural

AbstractAnalysis of chemical graphs is becoming a major research topic in computational molecular biology due to its potential applications to drug design. One of the major approaches in such a study is inverse quantitative structure activity/property relationship (inverse QSAR/QSPR) analysis, which is to infer chemical structures from given chemical activities/properties. Recently, a novel two-phase framework has been proposed for inverse QSAR/QSPR, where in the first phase an artificial neural network (ANN) is used to construct a prediction function. In the second phase, a mixed integer linear program (MILP) formulated on the trained ANN and a graph search algorithm are used to infer desired chemical structures. The framework has been applied to the case of chemical compounds with cycle index up to 2 so far. The computational results conducted on instances with n non-hydrogen atoms show that a feature vector can be inferred by solving an MILP for up to $$n=40$$ n = 40 , whereas graphs can be enumerated for up to $$n=15$$ n = 15 . When applied to the case of chemical acyclic graphs, the maximum computable diameter of a chemical structure was up to 8. In this paper, we introduce a new characterization of graph structure, called “branch-height” based on which a new MILP formulation and a new graph search algorithm are designed for chemical acyclic graphs. The results of computational experiments using such chemical properties as octanol/water partition coefficient, boiling point and heat of combustion suggest that the proposed method can infer chemical acyclic graphs with around $$n=50$$ n = 50 and diameter 30.

Download Full-text

INGOT-DR: an interpretable classifier for predicting drug resistance in M. tuberculosis

Algorithms for Molecular Biology ◽

10.1186/s13015-021-00198-1 ◽

2021 ◽

Vol 16 (1) ◽

Author(s):

Hooman Zabeti ◽

Nick Dexter ◽

Amir Hosein Safari ◽

Nafiseh Sedaghat ◽

Maxwell Libbrecht ◽

...

Keyword(s):

Machine Learning ◽

Drug Resistance ◽

Predictive Accuracy ◽

Group Testing ◽

Predictive Performance ◽

Machine Learning Techniques ◽

Evaluation Metrics ◽

Lower Accuracy ◽

Unseen Data ◽

The One

Abstract Motivation Prediction of drug resistance and identification of its mechanisms in bacteria such as Mycobacterium tuberculosis, the etiological agent of tuberculosis, is a challenging problem. Solving this problem requires a transparent, accurate, and flexible predictive model. The methods currently used for this purpose rarely satisfy all of these criteria. On the one hand, approaches based on testing strains against a catalogue of previously identified mutations often yield poor predictive performance; on the other hand, machine learning techniques typically have higher predictive accuracy, but often lack interpretability and may learn patterns that produce accurate predictions for the wrong reasons. Current interpretable methods may either exhibit a lower accuracy or lack the flexibility needed to generalize them to previously unseen data. Contribution In this paper we propose a novel technique, inspired by group testing and Boolean compressed sensing, which yields highly accurate predictions, interpretable results, and is flexible enough to be optimized for various evaluation metrics at the same time. Results We test the predictive accuracy of our approach on five first-line and seven second-line antibiotics used for treating tuberculosis. We find that it has a higher or comparable accuracy to that of commonly used machine learning models, and is able to identify variants in genes with previously reported association to drug resistance. Our method is intrinsically interpretable, and can be customized for different evaluation metrics. Our implementation is available at github.com/hoomanzabeti/INGOT_DR and can be installed via The Python Package Index (Pypi) under ingotdr. This package is also compatible with most of the tools in the Scikit-learn machine learning library.

Download Full-text

Algorithms for Molecular Biology
Latest Publications

TOTAL DOCUMENTS

H-INDEX

Published By Springer (Biomed Central Ltd.)

A new 1.375-approximation algorithm for sorting by transpositions

An optimized FM-index library for nucleotide and amino acid search

An improved approximation algorithm for the reversal and transposition distance considering gene order and intergenic sizes

A simpler linear-time algorithm for the common refinement of rooted phylogenetic trees on a common leaf set

Testing the agreement of trees with internal labels

Approximation algorithm for rearrangement distances considering repeated genes and intergenic regions

DeepGRP: engineering a software tool for predicting genomic repetitive elements using Recurrent Neural Networks with attention

Heuristic algorithms for best match graph editing

A novel method for inference of acyclic chemical compounds with bounded branch-height based on artificial neural networks and integer programming

INGOT-DR: an interpretable classifier for predicting drug resistance in M. tuberculosis

Export Citation Format

Algorithms for Molecular BiologyLatest Publications

TOTAL DOCUMENTS

H-INDEX

Published By Springer (Biomed Central Ltd.)

A new 1.375-approximation algorithm for sorting by transpositions

An optimized FM-index library for nucleotide and amino acid search

An improved approximation algorithm for the reversal and transposition distance considering gene order and intergenic sizes

A simpler linear-time algorithm for the common refinement of rooted phylogenetic trees on a common leaf set

Testing the agreement of trees with internal labels

Approximation algorithm for rearrangement distances considering repeated genes and intergenic regions

DeepGRP: engineering a software tool for predicting genomic repetitive elements using Recurrent Neural Networks with attention

Heuristic algorithms for best match graph editing

A novel method for inference of acyclic chemical compounds with bounded branch-height based on artificial neural networks and integer programming

INGOT-DR: an interpretable classifier for predicting drug resistance in M. tuberculosis

Algorithms for Molecular Biology
Latest Publications