scholarly journals Machine learning based imputation techniques for estimating phylogenetic trees from incomplete distance matrices

BMC Genomics ◽  
2020 ◽  
Vol 21 (1) ◽  
Author(s):  
Ananya Bhattacharjee ◽  
Md. Shamsuzzoha Bayzid
2019 ◽  
Author(s):  
Ananya Bhattacharjee ◽  
Md. Shamsuzzoha Bayzid

AbstractBackgroundDue to the recent advances in sequencing technologies and species tree estimation methods capable of taking gene tree discordance into account, notable progress has been achieved in constructing large scale phylogenetic trees from genome wide data. However, substantial challenges remain in leveraging this huge amount of molecular data. One of the foremost among these challenges is the need for efficient tools that can handle missing data. Popular distance-based methods such as neighbor joining and UPGMA require that the input distance matrix does not contain any missing values.ResultsWe introduce two highly accurate machine learning based distance imputation techniques. One of our approaches is based on matrix factorization, and the other one is an autoencoder based deep learning technique. We evaluate these two techniques on a collection of simulated and biological datasets, and show that our techniques match or improve upon the best alternate techniques for distance imputation. Moreover, our proposed techniques can handle substantial amount of missing data, to the extent where the best alternate methods fail.ConclusionsThis study shows for the first time the power and feasibility of applying deep learning techniques for imputing distance matrices. The autoencoder based deep learning technique is highly accurate and scalable to large dataset. We have made these techniques freely available as a cross-platform software (available at https://github.com/Ananya-Bhattacharjee/ImputeDistances).


2014 ◽  
Vol 15 (6) ◽  
pp. 783-817 ◽  
Author(s):  
MAURICE BRUYNOOGHE ◽  
HENDRIK BLOCKEEL ◽  
BART BOGAERTS ◽  
BROES DE CAT ◽  
STEF DE POOTER ◽  
...  

AbstractThis paper provides a gentle introduction to problem-solving with the IDP3 system. The core of IDP3 is a finite model generator that supports first-order logic enriched with types, inductive definitions, aggregates and partial functions. It offers its users a modeling language that is a slight extension of predicate logic and allows them to solve a wide range of search problems. Apart from a small introductory example, applications are selected from problems that arose within machine learning and data mining research. These research areas have recently shown a strong interest in declarative modeling and constraint-solving as opposed to algorithmic approaches. The paper illustrates that the IDP3 system can be a valuable tool for researchers with such an interest. The first problem is in the domain of stemmatology, a domain of philology concerned with the relationship between surviving variant versions of text. The second problem is about a somewhat related problem within biology where phylogenetic trees are used to represent the evolution of species. The third and final problem concerns the classical problem of learning a minimal automaton consistent with a given set of strings. For this last problem, we show that the performance of our solution comes very close to that of the state-of-the art solution. For each of these applications, we analyze the problem, illustrate the development of a logic-based model and explore how alternatives can affect the performance.


2005 ◽  
Vol 03 (06) ◽  
pp. 1429-1440 ◽  
Author(s):  
MANUEL GIL ◽  
CHRISTOPHE DESSIMOZ ◽  
GASTON H. GONNET

We present a dimensionless fit index for phylogenetic trees that have been constructed from distance matrices. It is designed to measure the quality of the fit of the data to a tree in absolute terms, independent of linear transformations on the distance matrix. The index can be used as an absolute measure to evaluate how well a set of data fits to a tree, or as a relative measure to compare different methods that are expected to produce the same tree. The usefulness of the index is demonstrated in three examples.


2020 ◽  
Vol 36 (17) ◽  
pp. 4590-4598
Author(s):  
Robert Page ◽  
Ruriko Yoshida ◽  
Leon Zhang

Abstract Motivation Due to new technology for efficiently generating genome data, machine learning methods are urgently needed to analyze large sets of gene trees over the space of phylogenetic trees. However, the space of phylogenetic trees is not Euclidean, so ordinary machine learning methods cannot be directly applied. In 2019, Yoshida et al. introduced the notion of tropical principal component analysis (PCA), a statistical method for visualization and dimensionality reduction using a tropical polytope with a fixed number of vertices that minimizes the sum of tropical distances between each data point and its tropical projection. However, their work focused on the tropical projective space rather than the space of phylogenetic trees. We focus here on tropical PCA for dimension reduction and visualization over the space of phylogenetic trees. Results Our main results are 2-fold: (i) theoretical interpretations of the tropical principal components over the space of phylogenetic trees, namely, the existence of a tropical cell decomposition into regions of fixed tree topology; and (ii) the development of a stochastic optimization method to estimate tropical PCs over the space of phylogenetic trees using a Markov Chain Monte Carlo approach. This method performs well with simulation studies, and it is applied to three empirical datasets: Apicomplexa and African coelacanth genomes as well as sequences of hemagglutinin for influenza from New York. Availability and implementation Dataset: http://polytopes.net/Data.tar.gz. Code: http://polytopes.net/tropica_MCMC_codes.tar.gz. Supplementary information Supplementary data are available at Bioinformatics online.


2020 ◽  
Author(s):  
Daniel Bojar ◽  
Rani K. Powers ◽  
Diogo M. Camacho ◽  
James J. Collins

AbstractGlycans, the most diverse biopolymer and crucial for many biological processes, are shaped by evolutionary pressures stemming in particular from host-pathogen interactions. While this positions glycans as being essential for understanding and targeting host-pathogen interactions, their considerable diversity and a lack of methods has hitherto stymied progress in leveraging their predictive potential. Here, we utilize a curated dataset of 12,674 glycans from 1,726 species to develop and apply machine learning methods to extract evolutionary information from glycans. Our deep learning-based language model SweetOrigins provides evolution-informed glycan representations that we utilize to discover and investigate motifs used for molecular mimicry-mediated immune evasion by commensals and pathogens. Novel glycan alignment methods enable us to identify and contextualize virulence-determining motifs in the capsular polysaccharide of Staphylococcus aureus and Acinetobacter baumannii. Further, we show that glycan-based phylogenetic trees contain most of the information present in traditional 16S rRNA-based phylogenies and improve on the differentiation of genetically closely related but phenotypically divergent species, such as Bacillus cereus and Bacillus anthracis. Leveraging the evolutionary information inherent in glycans with machine learning methodology is poised to provide further – critically needed – insights into host-pathogen interactions, sequence-to-function relationships, and the major influence of glycans on phenotypic plasticity.


Author(s):  
Guillermin Agüero-Chapin ◽  
Yuliana Jiménez ◽  
Aminael Sánchez-Rodríguez ◽  
Reinaldo Molina-Ruiz ◽  
Oscar Vivanco ◽  
...  

Background: Molecular phylogenetic algorithms frequently disagree with the approaches considering reproductive compatibility and morphological criteria for species delimitation. The question stems if the resulting species boundaries from molecular, reproductive and/or morphological data are definitively not reconcilable; or if the existing phylogenetic methods are not sensitive enough to agree morphological and genetic variation in species delimitation. Objectives : We propose to DISTATIS as an integrative framework to combine alignment-based (AB) and alignment-free (AF) distance matrices from ITS2 sequences/structures to shed light whether Gelasinospora and Neurospora are sister but independent genera? Methodology: We aimed at addressing this standing issue by harmonizing genus-specific classification based on their ascospore morphology and ITS2 molecular data. To validate our proposal, three phylogenetic approaches: i) traditional alignment-based, ii) alignment-free and iii) novel distance integrative (DI)-based were comparatively evaluated on a set of Gelasinospora and Neurospora species. All considered species have been extensively characterized at both the morphological and reproductive levels and there are known incongruences between their ascospore morphology and molecular data that hampers genus-specific delimitation. Results: Traditional AB phylogenetic analyses fail at resolving the Gelasinospora and Neurospora genera into independent monophyletic clades following ascospore morphology criteria. In contrast, AF and DI approaches produced phylogenetic trees that could properly delimit the expected monophyletic clades. Conclusions: The DI approach outperformed the AF one in the sense that it could also divide the Neurospora species according to their reproduction mode.


2021 ◽  
Author(s):  
Jayanta Pal ◽  
Soumen Ghosh ◽  
Bansibadan Maji ◽  
Dilip Kumar Bhattacharya

Abstract Similarity/dissimilarity study of protein and genome sequences remains a challenging task and selection of techniques and descriptors to be adopted, plays an important role in computational biology. Again, genome sequence comparison is always preferred to protein sequence comparison due the presence of 20 amino acids in protein sequence compared to only 4 nucleotides in genome sequence. So it is important to consider suitable representation that is both time and space efficient and also equally applicable to protein sequences of equal and unequal lengths. In the binary form of representation, Fourier transform of a protein sequence reduces to the transformation of 20 simple binary sequences in Fourier domain, where in each such sequence, Perseval’s Identity gives a very simple computable form of power spectrum. This gives rise to readily acceptable forms of moments of different degrees. Again such moments, when properly normalized, show a monotonically descending trend with the increase in the degrees of the moments. So it is better to stick to moments of smaller degrees only. In this paper, descriptors are taken as 20 component vectors, where each component corresponds to a general second order moment of one of the 20 simple binary sequences. Then distance matrices are obtained by using Euclidean distance as the distance measure between each pair of sequence. Phylogenetic trees are obtained from the distance matrices using UPGMA algorithm. In the present paper, the datasets used for similarity/dissimilarity study are 9 ND4, 16 ND5, 9 ND6, 24 TF proteins and 12 Baculovirus proteins. It is found that the phylogenetic trees produced by the present method are at par with those produced by the earlier methods adopted by other authors and also their known biological references. Further it takes less computational time and also it is equally applicable to sequences of equal and unequal lengths.


2021 ◽  
Vol 66 (1) ◽  
pp. 37
Author(s):  
P. Liptak ◽  
A. Kiss

With the development of sequencing technologies, more and more amounts of sequence data are available. This poses additional challenges, such as processing them is usually a complex and time-consuming computational task. During the construction of phylogenetic trees, the relationship between the sequences is examined, and an attempt is made to represent the evolutionary relationship. There are several algorithms for this problem, but with the development of computer science, the question arises as to whether new technologies can be exploited in these areas of computational biology. In the following publication, we investigate whether the reinforced learning model of machine learning can generate accurate phylogenetic trees based on the distance matrix.


2020 ◽  
Author(s):  
Francesco Ballesio ◽  
Ali Haider Bangash ◽  
Didier Barradas-Bautista ◽  
Justin Barton ◽  
Andrea Guarracino ◽  
...  

The pandemicity & the ability of the SARS-COV-2 to reinfect a cured subject, among other damaging characteristics of it, took everybody by surprise. A global collaborative scientific effort was direly required to bring learned people from different niches of medicine & data science together. Such a platform was provided by COVID19 Virtual BioHackathon, organized from the 5th to the 11th of April, 2020, to ponder on the related pressing issues varying in their diversity from text mining to genomics. Under the "Machine learning" track, we determined optimal k-mer length for feature extraction, constructed continuous distributed representations for protein sequences to create phylogenetic trees in an alignment-free manner, and clustered predicted MHC class I and II binding affinity to aid in vaccine design. All the related work in available in a Github repository under an MIT license for future research.


Sign in / Sign up

Export Citation Format

Share Document