A Covariance Matrix Inversion Problem arising from the Construction of Phylogenetic Trees

AbstractWe describe an efficient algorithm for the inversion of covariance matrices that arise in the context of phylogenetic tree construction. Phylogenetic trees describe the evolutionary relationships between species, and their construction is computationally demanding. Many approaches involve the symmetric matrix of evolutionary distances between species. Regarding these distances as random variables, the corresponding set of variances and covariances form a rank-4 tensor, and the inner-product defined by its inverse can be used to assign statistical scores to candidate trees. We describe a natural set of assumptions for the phylogenetic tree under construction, and show how under these assumptions the covariance tensor for a tree with n leaves can be inverted in O(n2) operations. In addition to presenting the inversion algorithm, we hope this article will open algebraic and computational problems from the field of phylogeny to a wider audience.

Download Full-text

Mathematical Understanding of Sequence Alignment and Phylogenetic Algorithms: A Comprehensive Review of Methods

10.21203/rs.3.rs-105281/v1 ◽

2020 ◽

Author(s):

Rashid Saif ◽

Sadia Nadeem ◽

Ali Iftekhar ◽

Alishba Khaliq ◽

Saeeda Zia

Keyword(s):

Phylogenetic Tree ◽

Sequence Alignment ◽

Phylogenetic Trees ◽

Evolutionary Relationship ◽

Biological Sequences ◽

Pairwise Sequence Alignment ◽

Phylogenetic Tree Construction ◽

Local Sequence Alignment ◽

Tree Construction ◽

Local Sequence

Abstract Context: Pairwise sequence alignment is one of the ways to arrange two biological sequences to identify regions of resemblance that may suggest the functional, structural, and/or evolutionary relationship (proteins or nucleic acids) between the sequences. There are two strategies in pairwise sequence alignment: Local sequence Alignment (Smith-waterman algorithm) and Global sequence Alignment (Needleman-Wunsch algorithm). In local sequence alignment, two sequences that may or may not be related are aligned to find regions of local similarities in large sequences whereas in global sequence alignment, two sequences same in length are aligned to identify conserved regions. Similarities and divergence between biological sequences identified by sequence alignment also have to be rationalized and visualized in the sense of phylogenetic trees. The phylogenetic tree construction methods are divided into distance-based and character-based methods. Evidence Acquisition: In this article, different algorithms of sequence alignment and phylogenetic tree construction were studied with examples and compared to establish the best among them to look into background of these methods for the better understanding of computational phylogenetics.Conclusions: Pairwise sequence alignment is a very important part of bioinformatics to compare biological sequences to find similarities among them. The alignment data is visualized through phylogenetic tree diagram that shows evolutionary history among organisms. Phylogenetic tree is constructed through various methods some are easier but does not provide accurate evolutionary data whereas others provide accurate evolutionary distance among organism but are computationally exhaustive.

Download Full-text

Phylogenetic Tree Construction Using K-Mer Forest- Based Distance Calculation

International Journal of Online and Biomedical Engineering (iJOE) ◽

10.3991/ijoe.v16i07.13807 ◽

2020 ◽

Vol 16 (07) ◽

pp. 4 ◽

Cited By ~ 1

Author(s):

Gihan Gamage ◽

Nadeeshan Gimhana ◽

Indika Perera ◽

Shanaka Bandara ◽

Thilina Pathirana ◽

...

Keyword(s):

Phylogenetic Tree ◽

Dna Sequences ◽

Phylogenetic Trees ◽

Genetic Relatedness ◽

Biological Information ◽

Pairwise Distance ◽

Phylogenetic Tree Construction ◽

Distance Calculation ◽

Alignment Free ◽

Tree Construction

Phylogenetics is one of the dominant data engineering research disciplines based on biological information. More particularly here, we consider raw DNA sequences and do comparative analysis in order to come up with important conclusions. When representing evolutionary relationships among different organisms in a concise manner, the phylogenetic tree helps significantly. When constructing phylogenetic trees, the elementary step is to calculate the genetic distance among species. Alignment-based sequencing and alignment-free sequencing are the two main distance computation methods that are used to find genetic relatedness of different species. In this paper we propose a novel alignment-free, pairwise, distance calculation method based on k-mers and a state of art machine learning-based phylogenetic tree construction mechanism. With the proposed approach we can convert longer DNA sequences into compendious k-mer forests which gear up the efficiency of comparison. Later we construct the phylogenetic tree based on calculated distances with the help of an algorithm build upon k-medoid clustering, which guaranteed significant efficiency and accuracy compared to traditional phylogenetic tree construction methods.

Download Full-text

Distribution of COVID-19 and Phylogenetic Tree Construction of SARS-CoV-2 in Indonesia

Journal of Pure and Applied Microbiology ◽

10.22207/jpam.14.spl1.42 ◽

2020 ◽

Vol 14 (suppl 1) ◽

pp. 1035-1042

Author(s):

Dora Dayu Rahma Turista ◽

Aesthetica Islamy ◽

Viol Dhea Kharisma ◽

Arif Nur Muhammad Ansori

Keyword(s):

Phylogenetic Tree ◽

Phylogenetic Trees ◽

Case Fatality ◽

Medical Personnel ◽

Distribution Data ◽

Cure Rate ◽

Transmission Process ◽

Tree Construction ◽

The World ◽

History Of

Coronavirus disease 2019 (COVID-19) is a disease caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). COVID-19 has spread quickly across the world and has been declared a pandemic. Indonesia has many COVID-19 cases, with a high mortality rate. This study aimed to describe the distribution of COVID-19 in Indonesia and constructed the SARS-CoV-2 phylogenetic tree from Indonesian isolates and those from other countries, including other CoVs to determine their relationship. The distribution data of COVID-19 in Indonesia were obtained from the COVID-19 Management Handling Unit and descriptively analyzed. SARS-CoV-2 isolates were retrieved from the GenBank® (National Center of Biotechnology Information, USA) and GISAID EpiCoV™ databases and were used to construct phylogenetic trees using MEGA X software. Of the 37 provinces in Indonesia, five provinces with the highest case fatality rates were DKI Jakarta, Jawa Barat, Jawa Timur, and Banten, and the five provinces with the highest cure rate were Kepulauan Riau, Bali, Aceh, Gorontalo, and DI Yogyakarta. SARS-CoV-2 Indonesian isolates were closely related to SARS-CoV-2 isolates from other countries. The rapid and widespread distribution of SARS-CoV-2 in Indonesia was caused by the lack of compliance with territorial restrictions and dishonesty with medical personnel. These data revealed that mutations can occur during the transmission process, which can be caused by a history of travel and increased patient immunity.

Download Full-text

Computing nearest neighbour interchange distances between ranked phylogenetic trees

Journal of Mathematical Biology ◽

10.1007/s00285-021-01567-5 ◽

2021 ◽

Vol 82 (1-2) ◽

Author(s):

Lena Collienne ◽

Alex Gavryushkin

Keyword(s):

Cancer Research ◽

Computational Complexity ◽

Phylogenetic Tree ◽

Shortest Path ◽

Phylogenetic Trees ◽

Shortest Paths ◽

Nearest Neighbour ◽

Tree Inference ◽

Subtree Prune And Regraft ◽

Comparison Algorithms

AbstractMany popular algorithms for searching the space of leaf-labelled (phylogenetic) trees are based on tree rearrangement operations. Under any such operation, the problem is reduced to searching a graph where vertices are trees and (undirected) edges are given by pairs of trees connected by one rearrangement operation (sometimes called a move). Most popular are the classical nearest neighbour interchange, subtree prune and regraft, and tree bisection and reconnection moves. The problem of computing distances, however, is $${\mathbf {N}}{\mathbf {P}}$$ N P -hard in each of these graphs, making tree inference and comparison algorithms challenging to design in practice. Although anked phylogenetic trees are one of the central objects of interest in applications such as cancer research, immunology, and epidemiology, the computational complexity of the shortest path problem for these trees remained unsolved for decades. In this paper, we settle this problem for the ranked nearest neighbour interchange operation by establishing that the complexity depends on the weight difference between the two types of tree rearrangements (rank moves and edge moves), and varies from quadratic, which is the lowest possible complexity for this problem, to $${\mathbf {N}}{\mathbf {P}}$$ N P -hard, which is the highest. In particular, our result provides the first example of a phylogenetic tree rearrangement operation for which shortest paths, and hence the distance, can be computed efficiently. Specifically, our algorithm scales to trees with tens of thousands of leaves (and likely hundreds of thousands if implemented efficiently).

Download Full-text

PTC: an interactive tool for phylogenetic tree construction

Computational Systems Bioinformatics. CSB2003. Proceedings of the 2003 IEEE Bioinformatics Conference. CSB2003 ◽

10.1109/csb.2003.1227378 ◽

2004 ◽

Cited By ~ 3

Author(s):

C. Yang ◽

S. Khuri

Keyword(s):

Phylogenetic Tree ◽

Phylogenetic Tree Construction ◽

Interactive Tool ◽

Tree Construction

Download Full-text

Techniques for the verification of minimal phylogenetic trees illustrated with ten mammalian haemoglobin sequences

Biochemical Journal ◽

10.1042/bj1870065 ◽

1980 ◽

Vol 187 (1) ◽

pp. 65-74 ◽

Cited By ~ 12

Author(s):

D Penny ◽

M D Hendy ◽

L R Foulds

Keyword(s):

Amino Acid ◽

Phylogenetic Tree ◽

Protein Sequence ◽

Phylogenetic Trees ◽

Sequence Data ◽

Protein Sequences ◽

Nucleotide Sequences ◽

Amino Acid Sequences ◽

Minimal Tree ◽

Protein Sequence Data

We have recently reported a method to identify the shortest possible phylogenetic tree for a set of protein sequences [Foulds Hendy & Penny (1979) J. Mol. Evol. 13. 127–150; Foulds, Penny & Hendy (1979) J. Mol. Evol. 13, 151–166]. The present paper discusses issues that arise during the construction of minimal phylogenetic trees from protein-sequence data. The conversion of the data from amino acid sequences into nucleotide sequences is shown to be advantageous. A new variation of a method for constructing a minimal tree is presented. Our previous methods have involved first constructing a tree and then either proving that it is minimal or transforming it into a minimal tree. The approach presented in the present paper progressively builds up a tree, taxon by taxon. We illustrate this approach by using it to construct a minimal tree for ten mammalian haemoglobin alpha-chain sequences. Finally we define a measure of the complexity of the data and illustrate a method to derive a directed phylogenetic tree from the minimal tree.

Download Full-text

Phylogenetic Tree Construction from a Distance Matrix

Encyclopedia of Algorithms ◽

10.1007/978-0-387-30162-4_292 ◽

2008 ◽

pp. 651-653

Author(s):

Jesper Jansson

Keyword(s):

Phylogenetic Tree ◽

Distance Matrix ◽

Phylogenetic Tree Construction ◽

Tree Construction

Download Full-text

Analysis of SARS-CoV-2 nucleocapsid protein sequence variations in ASEAN countries

Medical Journal of Indonesia ◽

10.13181/mji.oa.215304 ◽

2021 ◽

Author(s):

Mochammad Rajasa Mukti Negara ◽

Ita Krissanti ◽

Gita Widya Pradini

Keyword(s):

Phylogenetic Tree ◽

Phylogenetic Trees ◽

Protein Sequences ◽

Reference Sequence ◽

N Protein ◽

Asean Country ◽

Sequence Variations ◽

Complete Sequences ◽

Asean Countries ◽

Global Initiative

BACKGROUND Nucleocapsid (N) protein is one of four structural proteins of SARS-CoV-2 which is known to be more conserved than spike protein and is highly immunogenic. This study aimed to analyze the variation of the SARS-CoV-2 N protein sequences in ASEAN countries, including Indonesia. METHODS Complete sequences of SARS-CoV-2 N protein from each ASEAN country were obtained from Global Initiative on Sharing All Influenza Data (GISAID), while the reference sequence was obtained from GenBank. All sequences collected from December 2019 to March 2021 were grouped to the clade according to GISAID, and two representative isolates were chosen from each clade for the analysis. The sequences were aligned by MUSCLE, and phylogenetic trees were built using MEGA-X software based on the nucleotide and translated AA sequences. RESULTS 98 isolates of complete N protein genes from ASEAN countries were analyzed. The nucleotides of all isolates were 97.5% conserved. Of 31 nucleotide changes, 22 led to amino acid (AA) substitutions; thus, the AA sequences were 94.5% conserved. The phylogenetic tree of nucleotide and AA sequences shows similar branches. Nucleotide variations in clade O (C28311T); clade GR (28881–28883 GGG>AAC); and clade GRY (28881–28883 GGG>AAC and C28977T) lead to specific branches corresponding to the clade within both trees. CONCLUSIONS The N protein sequences of SARS-CoV-2 across ASEAN countries are highly conserved. Most isolates were closely related to the reference sequence originating from China, except the isolates representing clade O, GR, and GRY which formed specific branches in the phylogenetic tree.

Download Full-text