scholarly journals ModelRevelator: Fast phylogenetic model estimation via deep learning

2021 ◽  
Author(s):  
Sebastian Burgstaller-Muehlbacher ◽  
Stephen M Crotty ◽  
Heiko A Schmidt ◽  
Tamara Drucks ◽  
Arndt von Haeseler

Selecting the best model of sequence evolution for a multiple sequence alignment (MSA) constitutes the first step of phylogenetic tree reconstruction. Common approaches for inferring nucleotide models typically apply maximum likelihood (ML) methods, with discrimination between models determined by one of several information criteria. This requires tree reconstruction and optimisation which can be computationally expensive. We demonstrate that neural networks can be used to perform model selection, without the need to reconstruct trees, optimise parameters, or calculate likelihoods. We introduce ModelRevelator, a model selection tool underpinned by two deep neural networks. The first neural network, NNmodelfind, recommends one of six commonly used models of sequence evolution, ranging in complexity from JC to GTR. The second, NNalphafind, recommends whether or not a Γ--distributed rate heterogeneous model should be incorporated, and if so, provides an estimate of the shape parameter, ɑ. Users can simply input an MSA into ModelRevelator, and swiftly receive output recommending the evolutionary model, inclusive of the presence or absence of rate heterogeneity, and an estimate of ɑ. We show that ModelRevelator performs comparably with likelihood-based methods over a wide range of parameter settings, with significant potential savings in computational effort. Further, we show that this performance is not restricted to the alignments on which the networks were trained, but is maintained even on unseen empirical data. ModelRevelator will be made freely available in the forthcoming version of IQ-Tree (http://www.iqtree.org), and we expect it will provide a valuable alternative for phylogeneticists, especially where traditional methods of model selection are computationally prohibitive.

2021 ◽  
Vol 21 (1) ◽  
Author(s):  
Stephanie J. Spielman ◽  
Molly L. Miraglia

Abstract Background Multiple sequence alignments (MSAs) represent the fundamental unit of data inputted to most comparative sequence analyses. In phylogenetic analyses in particular, errors in MSA construction have the potential to induce further errors in downstream analyses such as phylogenetic reconstruction itself, ancestral state reconstruction, and divergence time estimation. In addition to providing phylogenetic methods with an MSA to analyze, researchers must also specify a suitable evolutionary model for the given analysis. Most commonly, researchers apply relative model selection to select a model from candidate set and then provide both the MSA and the selected model as input to subsequent analyses. While the influence of MSA errors has been explored for most stages of phylogenetics pipelines, the potential effects of MSA uncertainty on the relative model selection procedure itself have not been explored. Results We assessed the consistency of relative model selection when presented with multiple perturbed versions of a given MSA. We find that while relative model selection is mostly robust to MSA uncertainty, in a substantial proportion of circumstances, relative model selection identifies distinct best-fitting models from different MSAs created from the same set of sequences. We find that this issue is more pervasive for nucleotide data compared to amino-acid data. However, we also find that it is challenging to predict whether relative model selection will be robust or sensitive to uncertainty in a given MSA. Conclusions We find that that MSA uncertainty can affect virtually all steps of phylogenetic analysis pipelines to a greater extent than has previously been recognized, including relative model selection.


2021 ◽  
Author(s):  
Stephanie J Spielman ◽  
Molly Miraglia

Multiple sequence alignments (MSAs) represent the fundamental unit of data inputted to most comparative sequence analyses. In phylogenetic analyses in particular, errors in MSA construction have the potential to induce further errors in downstream analyses such as phylogenetic reconstruction itself, ancestral state reconstruction, and divergence estimation. In addition to providing phylogenetic methods with an MSA to analyze, researchers must also specify a suitable evolutionary model for the given analysis. Most commonly, researchers apply relative model selection to select a model from candidate set and then provide both the MSA and the selected model as input to subsequent analyses. While the influence of MSA errors has been explored for most stages of phylogenetics pipelines, the potential effects of MSA uncertainty on the relative model selection procedure itself have not been explored. In this study, we assessed the consistency of relative model selection when presented with multiple perturbed versions of a given MSA. We find that while relative model selection is mostly robust to MSA uncertainty, in a substantial proportion of circumstances, relative model selection identifies distinct best-fitting models from different MSAs created from the same set of sequences. We find that this issue is more pervasive for nucleotide data compared to amino-acid data. However, we also find that it is challenging to predict whether relative model selection will be robust or sensitive to uncertainty in a given MSA. We find that that MSA uncertainty can affect virtually all steps of phylogenetic analysis pipelines to a greater extent than has previously been recognized, including relative model selection.


2021 ◽  
Author(s):  
Tomer Tsaban ◽  
Julia K Varga ◽  
Orly Avraham ◽  
Ziv Ben Aharon ◽  
Alisa Khramushin ◽  
...  

Highly accurate protein structure predictions by the recently published deep neural networks such as AlphaFold2 and RoseTTAFold are truly impressive achievements, and will have a tremendous impact far beyond structural biology. If peptide-protein binding can be seen as a final complementing step in the folding of a protein monomer, we reasoned that these approaches might be applicable to the modeling of such interactions. We present a simple implementation of AlphaFold2 to model the structure of peptide-protein interactions, enabled by linking the peptide sequence to the protein c-terminus via a poly glycine linker. We show on a large non-redundant set of 162 peptide-protein complexes that peptide-protein interactions can indeed be modeled accurately. Importantly, prediction is fast and works without multiple sequence alignment information for the peptide partner. We compare performance on a smaller, representative set to the state-of-the-art peptide docking protocol PIPER-FlexPepDock, and describe in detail specific examples that highlight advantages of the two approaches, pointing to possible further improvements and insights in the modeling of peptide-protein interactions. Peptide-mediated interactions play important regulatory roles in functional cells. Thus the present advance holds much promise for significant impact, by bringing into reach a wide range of peptide-protein complexes, and providing important starting points for detailed study and manipulation of many specific interactions.


2022 ◽  
Vol 13 (1) ◽  
Author(s):  
Tomer Tsaban ◽  
Julia K. Varga ◽  
Orly Avraham ◽  
Ziv Ben-Aharon ◽  
Alisa Khramushin ◽  
...  

AbstractHighly accurate protein structure predictions by deep neural networks such as AlphaFold2 and RoseTTAFold have tremendous impact on structural biology and beyond. Here, we show that, although these deep learning approaches have originally been developed for the in silico folding of protein monomers, AlphaFold2 also enables quick and accurate modeling of peptide–protein interactions. Our simple implementation of AlphaFold2 generates peptide–protein complex models without requiring multiple sequence alignment information for the peptide partner, and can handle binding-induced conformational changes of the receptor. We explore what AlphaFold2 has memorized and learned, and describe specific examples that highlight differences compared to state-of-the-art peptide docking protocol PIPER-FlexPepDock. These results show that AlphaFold2 holds great promise for providing structural insight into a wide range of peptide–protein complexes, serving as a starting point for the detailed characterization and manipulation of these interactions.


2021 ◽  
Author(s):  
Tomer Tsaban ◽  
Julia Varga ◽  
Orly Avraham ◽  
Ziv Ben-Aharon ◽  
Alisa Khramushin ◽  
...  

Abstract Highly accurate protein structure predictions by the recently published deep neural networks such as AlphaFold2 and RoseTTAFold are truly impressive achievements, and will have a tremendous impact far beyond structural biology. If peptide-protein binding can be seen as a final complementing step in the folding of a protein monomer, we reasoned that these approaches might be applicable to the modeling of such interactions. We present a simple implementation of AlphaFold2 to model the structure of peptide-protein interactions, enabled by linking the peptide sequence to the protein c-terminus via a poly glycine linker. We show on a large non-redundant set of 162 peptide-protein complexes that peptide-protein interactions can indeed be modeled accurately. Importantly, prediction is fast and works without multiple sequence alignment information for the peptide partner. We compare performance on a smaller, representative set to the state-of-the-art peptide docking protocol PIPER-FlexPepDock, and describe in detail specific examples that highlight advantages of the two approaches, pointing to possible further improvements and insights in the modeling of peptide-protein interactions. Peptide-mediated interactions play important regulatory roles in functional cells. Thus the present advance holds much promise for significant impact, by bringing into reach a wide range of peptide-protein complexes, and providing important starting points for detailed study and manipulation of many specific interactions.


2019 ◽  
Author(s):  
Michael Gerth

ABSTRACTMolecular phylogenetics is a standard tool in modern biology that informs the evolutionary history of genes, organisms, and traits, and as such is important in a wide range of disciplines from medicine to palaeontology. Maximum likelihood phylogenetic reconstruction involves assumptions about the evolutionary processes that underlie the dataset to be analysed. These assumptions must be specified in forms of an evolutionary model, and a number of criteria may be used to identify the best-fitting from a plethora of available models of DNA evolution. Using many empirical and simulated nucleotide sequence alignments, Abadi et al.1 have recently found that phylogenetic inferences using best models identified by six different model selection criteria are, on average, very similar to each other. They further claimed that using the model GTR+I+G4 without prior model-fitting results in similarly accurate phylogenetic estimates, and consequently that skipping model selection entirely has no negative impact on many phylogenetic applications. Focussing on this claim, I here revisit and re-analyse some of the data put forward by Abadi et al. I argue that while the presented analyses are sound, the results are misrepresented and in fact - in line with previous work - demonstrate that model selection consistently leads to different phylogenetic estimates compared with using fixed models.


2012 ◽  
Vol 9 (1) ◽  
pp. 1
Author(s):  
Mohd Fakharul Zaman Raja Yahya ◽  
Hasidah Mohd Sidek

Malaria parasites, Plasmodium can infect a wide range of hosts including humans and rodents. There are two copies of mitogen activated protein kinases (MAPKs) in Plasmodium, namely MAPK1 and MAPK2. The MAPKs have been studied extensively in the human Plasmodium, P. falciparum. However, the MAPKs from other Plasmodium species have not been characterized and it is therefore the premise of presented study to characterize the MAPKs from other Plasmodium species-P. vivax, P. knowlesi, P. berghei, P. chabaudi and P.yoelli using a series of publicly available bioinformatic tools. In silico data indicates that all Plasmodium MAPKs are nuclear-localized and contain both a nuclear localization signal (NLS) and a Leucine-rich nuclear export signal (NES). The activation motifs of TDY and TSH were found to be fully conserved in Plasmodium MAPK1 and MAPK2, respectively. The detailed manual inspection of a multiple sequence alignment (MSA) construct revealed a total of 17 amino acid stack patterns comprising of different amino acids present in MAPKJ and MAPK2 respectively, with respect to rodent and human Plasmodia. It is proposed that these amino acid stack patterns may be useful in explaining the disparity between rodent and human Plasmodium MAPKs. 


2012 ◽  
Vol 9 (1) ◽  
pp. 1
Author(s):  
Mohd Fakharul Zaman Raja Yahya ◽  
Hasidah Mohd Sidek

Malaria parasites, Plasmodium can infect a wide range ofhosts including humans and rodents. There are two copies ofmitogen activated protein kinases (MAPKs) in Plasmodium, namely MAPK1 and MAPK2. The MAPKs have been studied extensively in the human Plasmodium, P. falciparum. However, the MAPKs from other Plasmodium species have not been characterized and it is therefore the premise ofpresented study to characterize the MAPKs from other Plasmodium species-P. vivax, P. knowlesi, P. berghei, P. chabaudi and P.yoelli using a series ofpublicly available bioinformatic tools. In silico data indicates that all Plasmodium MAPKs are nuclear-localizedandcontain both a nuclear localization signal (NLS) anda Leucine-rich nuclear export signal (NES). The activation motifs ofTDYand TSH werefound to befully conserved in Plasmodium MAPK1 and MAPK2, respectively. The detailed manual inspection ofa multiple sequence alignment (MSA) construct revealed a total of 17 amino acid stack patterns comprising ofdifferent amino acids present in MAPK1 and MAPK2 respectively, with respect to rodent and human Plasmodia. 1t is proposed that these amino acid stack patterns may be useful in explaining the disparity between rodent and human Plasmodium MAPKs.


2020 ◽  
Vol 6 (1) ◽  
Author(s):  
Malte Seemann ◽  
Lennart Bargsten ◽  
Alexander Schlaefer

AbstractDeep learning methods produce promising results when applied to a wide range of medical imaging tasks, including segmentation of artery lumen in computed tomography angiography (CTA) data. However, to perform sufficiently, neural networks have to be trained on large amounts of high quality annotated data. In the realm of medical imaging, annotations are not only quite scarce but also often not entirely reliable. To tackle both challenges, we developed a two-step approach for generating realistic synthetic CTA data for the purpose of data augmentation. In the first step moderately realistic images are generated in a purely numerical fashion. In the second step these images are improved by applying neural domain adaptation. We evaluated the impact of synthetic data on lumen segmentation via convolutional neural networks (CNNs) by comparing resulting performances. Improvements of up to 5% in terms of Dice coefficient and 20% for Hausdorff distance represent a proof of concept that the proposed augmentation procedure can be used to enhance deep learning-based segmentation for artery lumen in CTA images.


Sign in / Sign up

Export Citation Format

Share Document