Accurate inference of tree topologies from multiple sequence alignments using deep learning

AbstractReconstructing the phylogenetic relationships between species is one of the most formidable tasks in evolutionary biology. Multiple methods exist to reconstruct phylogenetic trees, each with their own strengths and weaknesses. Both simulation and empirical studies have identified several “zones” of parameter space where accuracy of some methods can plummet, even for four-taxon trees. Further, some methods can have undesirable statistical properties such as statistical inconsistency and/or the tendency to be positively misleading (i.e. assert strong support for the incorrect tree topology). Recently, deep learning techniques have made inroads on a number of both new and longstanding problems in biological research. Here we designed a deep convolutional neural network (CNN) to infer quartet topologies from multiple sequence alignments. This CNN can readily be trained to make inferences using both gapped and ungapped data. We show that our approach is highly accurate, often outperforming traditional methods, and is remarkably robust to bias-inducing regions of parameter space such as the Felsenstein zone and the Farris zone. We also demonstrate that the confidence scores produced by our CNN can more accurately assess support for the chosen topology than bootstrap and posterior probability scores from traditional methods. While numerous practical challenges remain, these findings suggest that deep learning approaches such as ours have the potential to produce more accurate phylogenetic inferences.

Download Full-text

Accurate Inference of Tree Topologies from Multiple Sequence Alignments Using Deep Learning

Systematic Biology ◽

10.1093/sysbio/syz060 ◽

2019 ◽

Cited By ~ 2

Author(s):

Anton Suvorov ◽

Joshua Hochuli ◽

Daniel R Schrider

Keyword(s):

Deep Learning ◽

Parameter Space ◽

Phylogenetic Trees ◽

Strong Support ◽

Biological Research ◽

Learning Approaches ◽

Sequence Alignments ◽

Traditional Methods ◽

Multiple Sequence ◽

Multiple Sequence Alignments

Abstract Reconstructing the phylogenetic relationships between species is one of the most formidable tasks in evolutionary biology. Multiple methods exist to reconstruct phylogenetic trees, each with their own strengths and weaknesses. Both simulation and empirical studies have identified several “zones” of parameter space where accuracy of some methods can plummet, even for four-taxon trees. Further, some methods can have undesirable statistical properties such as statistical inconsistency and/or the tendency to be positively misleading (i.e. assert strong support for the incorrect tree topology). Recently, deep learning techniques have made inroads on a number of both new and longstanding problems in biological research. In this study, we designed a deep convolutional neural network (CNN) to infer quartet topologies from multiple sequence alignments. This CNN can readily be trained to make inferences using both gapped and ungapped data. We show that our approach is highly accurate on simulated data, often outperforming traditional methods, and is remarkably robust to bias-inducing regions of parameter space such as the Felsenstein zone and the Farris zone. We also demonstrate that the confidence scores produced by our CNN can more accurately assess support for the chosen topology than bootstrap and posterior probability scores from traditional methods. Although numerous practical challenges remain, these findings suggest that the deep learning approaches such as ours have the potential to produce more accurate phylogenetic inferences.

Download Full-text

Analyzing effect of quadruple multiple sequence alignments on deep learning based protein inter-residue distance prediction

Scientific Reports ◽

10.1038/s41598-021-87204-z ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Aashish Jain ◽

Genki Terashi ◽

Yuki Kagaya ◽

Sai Raghavendra Maddhuri Venkata Subramaniya ◽

Charles Christoffer ◽

...

Keyword(s):

Deep Learning ◽

Structure Prediction ◽

Tertiary Structure ◽

3D Structure ◽

Evolutionary Information ◽

Learning Approaches ◽

Sequence Alignments ◽

Multiple Sequence ◽

Multiple Sequence Alignments ◽

Novel Approach

AbstractProtein 3D structure prediction has advanced significantly in recent years due to improving contact prediction accuracy. This improvement has been largely due to deep learning approaches that predict inter-residue contacts and, more recently, distances using multiple sequence alignments (MSAs). In this work we present AttentiveDist, a novel approach that uses different MSAs generated with different E-values in a single model to increase the co-evolutionary information provided to the model. To determine the importance of each MSA’s feature at the inter-residue level, we added an attention layer to the deep neural network. We show that combining four MSAs of different E-value cutoffs improved the model prediction performance as compared to single E-value MSA features. A further improvement was observed when an attention layer was used and even more when additional prediction tasks of bond angle predictions were added. The improvement of distance predictions were successfully transferred to achieve better protein tertiary structure modeling.

Download Full-text

DNCON2_Inter: predicting interchain contacts for homodimeric and homomultimeric protein complexes using multiple sequence alignments of monomers and deep learning

Scientific Reports ◽

10.1038/s41598-021-91827-7 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Farhan Quadir ◽

Raj S. Roy ◽

Randal Halfmann ◽

Jianlin Cheng

Keyword(s):

Deep Learning ◽

Tertiary Structure ◽

Protein Complexes ◽

Complex Structure ◽

Great Success ◽

Sequence Alignments ◽

Multiple Sequence ◽

Multiple Sequence Alignments ◽

Residue Contacts ◽

Evolutionary Features

AbstractDeep learning methods that achieved great success in predicting intrachain residue-residue contacts have been applied to predict interchain contacts between proteins. However, these methods require multiple sequence alignments (MSAs) of a pair of interacting proteins (dimers) as input, which are often difficult to obtain because there are not many known protein complexes available to generate MSAs of sufficient depth for a pair of proteins. In recognizing that multiple sequence alignments of a monomer that forms homomultimers contain the co-evolutionary signals of both intrachain and interchain residue pairs in contact, we applied DNCON2 (a deep learning-based protein intrachain residue-residue contact predictor) to predict both intrachain and interchain contacts for homomultimers using multiple sequence alignment (MSA) and other co-evolutionary features of a single monomer followed by discrimination of interchain and intrachain contacts according to the tertiary structure of the monomer. We name this tool DNCON2_Inter. Allowing true-positive predictions within two residue shifts, the best average precision was obtained for the Top-L/10 predictions of 22.9% for homodimers and 17.0% for higher-order homomultimers. In some instances, especially where interchain contact densities are high, DNCON2_Inter predicted interchain contacts with 100% precision. We also developed Con_Complex, a complex structure reconstruction tool that uses predicted contacts to produce the structure of the complex. Using Con_Complex, we show that the predicted contacts can be used to accurately construct the structure of some complexes. Our experiment demonstrates that monomeric multiple sequence alignments can be used with deep learning to predict interchain contacts of homomeric proteins.

Download Full-text

AttentiveDist: Protein Inter-Residue Distance Prediction Using Deep Learning with Attention on Quadruple Multiple Sequence Alignments

10.1101/2020.11.24.396770 ◽

2020 ◽

Author(s):

Aashish Jain ◽

Genki Terashi ◽

Yuki Kagaya ◽

Sai Raghavendra Maddhuri Venkata Subramaniya ◽

Charles Christoffer ◽

...

Keyword(s):

Deep Learning ◽

Structure Prediction ◽

Prediction Models ◽

3D Structure ◽

Evolutionary Information ◽

Sequence Alignments ◽

Multiple Sequence ◽

Contact Prediction ◽

Multiple Sequence Alignments ◽

Distance Prediction

ABSTRACTProtein 3D structure prediction has advanced significantly in recent years due to improving contact prediction accuracy. This improvement has been largely due to deep learning approaches that predict inter-residue contacts and, more recently, distances using multiple sequence alignments (MSAs). In this work we present AttentiveDist, a novel approach that uses different MSAs generated with different E-values in a single model to increase the co-evolutionary information provided to the model. To determine the importance of each MSA’s feature at the inter-residue level, we added an attention layer to the deep neural network. The model is trained in a multi-task fashion to also predict backbone and orientation angles further improving the inter-residue distance prediction. We show that AttentiveDist outperforms the top methods for contact prediction in the CASP13 structure prediction competition. To aid in structure modeling we also developed two new deep learning-based sidechain center distance and peptide-bond nitrogen-oxygen distance prediction models. Together these led to a 12% increase in TM-score from the best server method in CASP13 for structure prediction.

Download Full-text

PhyKIT: a broadly applicable UNIX shell toolkit for processing and analyzing phylogenomic data

Bioinformatics ◽

10.1093/bioinformatics/btab096 ◽

2021 ◽

Author(s):

Jacob L Steenwyk ◽

Thomas J Buida ◽

Abigail L Labella ◽

Yuanning Li ◽

Xing-Xing Shen ◽

...

Keyword(s):

Information Content ◽

Phylogenetic Trees ◽

Supplementary Information ◽

Sequence Alignments ◽

Multiple Sequence ◽

Sequence Composition ◽

Multiple Sequence Alignments ◽

Functional Relationships ◽

Biology Process ◽

Rate Evaluation

Abstract Motivation Diverse disciplines in biology process and analyze multiple sequence alignments (MSAs) and phylogenetic trees to evaluate their information content, infer evolutionary events and processes, and predict gene function. However, automated processing of MSAs and trees remains a challenge due to the lack of a unified toolkit. To fill this gap, we introduce PhyKIT, a toolkit for the UNIX shell environment with 30 functions that process MSAs and trees, including but not limited to estimation of mutation rate, evaluation of sequence composition biases, calculation of the degree of violation of a molecular clock, and collapsing bipartitions (internal branches) with low support. Results To demonstrate the utility of PhyKIT, we detail three use cases: (1) summarizing information content in MSAs and phylogenetic trees for diagnosing potential biases in sequence or tree data; (2) evaluating gene-gene covariation of evolutionary rates to identify functional relationships, including novel ones, among genes; and (3) identify lack of resolution events or polytomies in phylogenetic trees, which are suggestive of rapid radiation events or lack of data. We anticipate PhyKIT will be useful for processing, examining, and deriving biological meaning from increasingly large phylogenomic datasets. Availability PhyKIT is freely available on GitHub (https://github.com/JLSteenwyk/PhyKIT), PyPi (https://pypi.org/project/phykit/), and the Anaconda Cloud (https://anaconda.org/JLSteenwyk/phykit) under the MIT license with extensive documentation and user tutorials (https://jlsteenwyk.com/PhyKIT). Supplementary information Supplementary data are available on figshare (doi: 10.6084/m9.figshare.13118600) and are available at Bioinformatics online.

Download Full-text

Beyond Simple Homology Searches: Multiple Sequence Alignments and Phylogenetic Trees

Current Protocols Essential Laboratory Techniques ◽

10.1002/9780470089941.et1103s01 ◽

2009 ◽

Vol 1 (1) ◽

Cited By ~ 1

Author(s):

Rebecca A. Zufall

Keyword(s):

Phylogenetic Trees ◽

Sequence Alignments ◽

Multiple Sequence ◽

Multiple Sequence Alignments

Download Full-text

Beyond Simple Homology Searches: Multiple Sequence Alignments and Phylogenetic Trees

Current Protocols Essential Laboratory Techniques ◽

10.1002/cpet.9 ◽

2017 ◽

Vol 14 (1) ◽

Cited By ~ 1

Author(s):

Rebecca A. Zufall

Keyword(s):

Phylogenetic Trees ◽

Sequence Alignments ◽

Multiple Sequence ◽

Multiple Sequence Alignments

Download Full-text

Interim Report on Multiple Sequence Alignments and TaqMan Signature Mapping to Phylogenetic Trees

10.2172/1047247 ◽

2012 ◽

Author(s):

S Gardner ◽

C Jaing

Keyword(s):

Phylogenetic Trees ◽

Interim Report ◽

Sequence Alignments ◽

Multiple Sequence ◽

Multiple Sequence Alignments

Download Full-text

Evolutionary Relationships and Sequence-Structure Determinants in Human SARS Coronavirus-2 Spike Proteins for Host Receptor Recognition

10.26434/chemrxiv.12190449 ◽

2020 ◽

Author(s):

Lalitha Guruprasad

Keyword(s):

Phylogenetic Trees ◽

Disulfide Bridge ◽

Sequence Motifs ◽

Structural Determinants ◽

Sequence Alignments ◽

Multiple Sequence ◽

Angiotensin Converting Enzyme 2 ◽

Multiple Sequence Alignments ◽

Loop 2 ◽

Spike Proteins

<div>Coronavirus disease 2019 (COVID-19) is a pandemic infectious disease caused by novel Severe Acute Respiratory Syndrome coronavirus-2 (SARS CoV-2). The SARS CoV-2 is transmitted more rapidly and readily than SARS CoV. Both, SARS CoV and SARS CoV-2 via their glycosylated spike proteins recognize the human angiotensin converting enzyme-2 (ACE-2) receptor. We generated multiple sequence alignments and phylogenetic trees for representative spike proteins of CoV and CoV-2 from various host sources in order to analyze the specificity in SARS CoV-2 spike proteins required for causing infection in humans. Our results show that two sequence motifs in the N-terminal domain; "MESEFR" and "SYLTPG" are specific to human SARS CoV-2 and pangolin SARS CoV. In the receptor binding domain (RBD), three sequence loops; VGGNY (loop 1), YQAGSTPC (loop 2), EGFNCY (loop 3) and a tethered disulfide bridge Cys480-Cys488 connecting loops 2 and 3 are structural determinants for the recognition of human ACE-2 receptor. The complete genome analysis of representative SARS CoVs from bat, civet, pangolin, human host sources and human SARS CoV-2 identified the bat genome (GenBank code: MN996532.1) and the pangolin SARS CoV genomes as closest to the recent novel human SARS CoV-2 genomes. The bat CoV genomes (GenBank codes: MG772933 and MG772934) are evolutionary intermediates in the mutagenesis progression towards becoming human SARS CoV-2. </div>

Download Full-text

A fast alignment-free bioinformatics procedure to infer accurate distance-based phylogenetic trees from genome assemblies

Research Ideas and Outcomes ◽

10.3897/rio.5.e36178 ◽

2019 ◽

Vol 5 ◽

Cited By ~ 18

Author(s):

Alexis Criscuolo

Keyword(s):

Phylogenetic Trees ◽

Species Trees ◽

Sequence Alignments ◽

Multiple Sequence ◽

Internal Branch ◽

Multiple Sequence Alignments ◽

Fast Running ◽

Alignment Free ◽

Multiple Threads ◽

Genome Assemblies

This paper describes a novel alignment-free distance-based procedure for inferring phylogenetic trees from genome contig sequences using publicly available bioinformatics tools. For each pair of genomes, a dissimilarity measure is first computed and next transformed to obtain an estimation of the number of substitution events that have occurred during their evolution. These pairwise evolutionary distances are then used to infer a phylogenetic tree and assess a confidence support for each internal branch. Analyses of both simulated and real genome datasets show that this bioinformatics procedure allows accurate phylogenetic trees to be reconstructed with fast running times, especially when launched on multiple threads. Implemented in a publicly available script, named JolyTree, this procedure is a useful approach for quickly inferring species trees without the burden and potential biases of multiple sequence alignments.

Download Full-text