Deep residual neural networks resolve quartet molecular phylogenies

ABSTRACTPhylogenetic inference is of fundamental importance to evolutionary as well as other fields of biology, and molecular sequences have emerged as the primary data for this task. Although many phylogenetic methods have been developed to explicitly take into account substitution models of sequence evolution, such methods could fail due to model misspecification and insufficiency, especially in the face of heterogeneities in substitution processes across sites and among lineages. In this study, we propose to infer topologies of four-taxon trees using deep residual neural networks, a machine learning approach needing no explicit modeling of the subject system and having a record of success in solving complex non-linear inference problems. We train residual networks on simulated protein sequence data with extensive amino acid substitution heterogeneities. We show that the well-trained residual network predictors can outperform existing state-of-the-art inference methods such as the maximum likelihood method on diverse simulated test data, especially under extensive substitution heterogeneities. Reassuringly, residual network predictors generally agree with existing methods in the trees inferred from real phylogenetic data with known or widely believed topologies. We conclude that deep learning represents a powerful new approach to phylogenetic reconstruction, especially when sequences evolve via heterogeneous substitution processes. We present our best trained predictor in a freely available program named Phylogenetics by Deep Learning (PhyDL, https://gitlab.com/ztzou/phydl).

Download Full-text

Deep Residual Neural Networks Resolve Quartet Molecular Phylogenies

Molecular Biology and Evolution ◽

10.1093/molbev/msz307 ◽

2019 ◽

Vol 37 (5) ◽

pp. 1495-1507 ◽

Cited By ~ 1

Author(s):

Zhengting Zou ◽

Hongjiu Zhang ◽

Yuanfang Guan ◽

Jianzhi Zhang

Keyword(s):

Neural Networks ◽

Deep Learning ◽

Sequence Data ◽

Model Misspecification ◽

Phylogenetic Reconstruction ◽

Likelihood Method ◽

Primary Data ◽

Sequence Evolution ◽

Residual Network ◽

Inference Problems

Abstract Phylogenetic inference is of fundamental importance to evolutionary as well as other fields of biology, and molecular sequences have emerged as the primary data for this task. Although many phylogenetic methods have been developed to explicitly take into account substitution models of sequence evolution, such methods could fail due to model misspecification or insufficiency, especially in the face of heterogeneities in substitution processes across sites and among lineages. In this study, we propose to infer topologies of four-taxon trees using deep residual neural networks, a machine learning approach needing no explicit modeling of the subject system and having a record of success in solving complex nonlinear inference problems. We train residual networks on simulated protein sequence data with extensive amino acid substitution heterogeneities. We show that the well-trained residual network predictors can outperform existing state-of-the-art inference methods such as the maximum likelihood method on diverse simulated test data, especially under extensive substitution heterogeneities. Reassuringly, residual network predictors generally agree with existing methods in the trees inferred from real phylogenetic data with known or widely believed topologies. Furthermore, when combined with the quartet puzzling algorithm, residual network predictors can be used to reconstruct trees with more than four taxa. We conclude that deep learning represents a powerful new approach to phylogenetic reconstruction, especially when sequences evolve via heterogeneous substitution processes. We present our best trained predictor in a freely available program named Phylogenetics by Deep Learning (PhyDL, https://gitlab.com/ztzou/phydl; last accessed January 3, 2020).

Download Full-text

A Coalescent-Based Method for Detecting and Estimating Recombination From Gene Sequences

Genetics ◽

10.1093/genetics/160.3.1231 ◽

2002 ◽

Vol 160 (3) ◽

pp. 1231-1241 ◽

Cited By ~ 5

Author(s):

Gil McVean ◽

Philip Awadalla ◽

Paul Fearnhead

Keyword(s):

Recombination Rate ◽

Evolutionary Biology ◽

Sequence Data ◽

Recurrent Mutation ◽

Likelihood Method ◽

Viral Population ◽

Sequence Evolution ◽

Infinite Sites Model ◽

High Level ◽

Population Recombination

Abstract Determining the amount of recombination in the genealogical history of a sample of genes is important to both evolutionary biology and medical population genetics. However, recurrent mutation can produce patterns of genetic diversity similar to those generated by recombination and can bias estimates of the population recombination rate. Hudson (2001) has suggested an approximate-likelihood method based on coalescent theory to estimate the population recombination rate, 4Ner, under an infinite-sites model of sequence evolution. Here we extend the method to the estimation of the recombination rate in genomes, such as those of many viruses and bacteria, where the rate of recurrent mutation is high. In addition, we develop a powerful permutation-based method for detecting recombination that is both more powerful than other permutation-based methods and robust to misspecification of the model of sequence evolution. We apply the method to sequence data from viruses, bacteria, and human mitochondrial DNA. The extremely high level of recombination detected in both HIV1 and HIV2 sequences demonstrates that recombination cannot be ignored in the analysis of viral population genetic data.

Download Full-text

ANNAVP, using neural networks to predict neutralization efficiency of antibodies against viral strains and to cluster strains by protein sequence

10.1101/2020.09.21.307074 ◽

2020 ◽

Author(s):

Ghiță Iulian Cristian

Keyword(s):

Neural Networks ◽

Unsupervised Learning ◽

Protein Sequence ◽

Viral Protein ◽

Sequence Data ◽

Complex Task ◽

Antibody Neutralization ◽

Viral Antibody ◽

Protein Sequence Data ◽

Different Strains

AbstractStudying viral antibody neutralization data is a complex task and knowledge relating to the effectiveness of a particular antibody to particular strains of viruses cannot easily be extrapolated to other new, related strains. We have developed ANNAVP, a software that uses neural networks to model viral protein data. ANNAVP uses supervised or unsupervised learning and viral protein sequence data to form correlations between different strains and to predict the effectiveness of neutralizing agents against them.

Download Full-text

AC2: An Efficient Protein Sequence Compression Tool Using Artificial Neural Networks and Cache-Hash Models

Entropy ◽

10.3390/e23050530 ◽

2021 ◽

Vol 23 (5) ◽

pp. 530

Author(s):

Milton Silva ◽

Diogo Pratas ◽

Armando J. Pinho

Keyword(s):

Protein Sequence ◽

Sequence Data ◽

Specific Protein ◽

General Purpose ◽

Amino Acid Sequences ◽

Input Size ◽

Protein Sequence Data ◽

Analysis Application ◽

Straightforward Solution ◽

Human Coronaviruses

Recently, the scientific community has witnessed a substantial increase in the generation of protein sequence data, triggering emergent challenges of increasing importance, namely efficient storage and improved data analysis. For both applications, data compression is a straightforward solution. However, in the literature, the number of specific protein sequence compressors is relatively low. Moreover, these specialized compressors marginally improve the compression ratio over the best general-purpose compressors. In this paper, we present AC2, a new lossless data compressor for protein (or amino acid) sequences. AC2 uses a neural network to mix experts with a stacked generalization approach and individual cache-hash memory models to the highest-context orders. Compared to the previous compressor (AC), we show gains of 2–9% and 6–7% in reference-free and reference-based modes, respectively. These gains come at the cost of three times slower computations. AC2 also improves memory usage against AC, with requirements about seven times lower, without being affected by the sequences’ input size. As an analysis application, we use AC2 to measure the similarity between each SARS-CoV-2 protein sequence with each viral protein sequence from the whole UniProt database. The results consistently show higher similarity to the pangolin coronavirus, followed by the bat and human coronaviruses, contributing with critical results to a current controversial subject. AC2 is available for free download under GPLv3 license.

Download Full-text

Evolutionary Analysis of Plastid Genomes of Seven Lonicera L. Species: Implications for Sequence Divergence and Phylogenetic Relationships

International Journal of Molecular Sciences ◽

10.3390/ijms19124039 ◽

2018 ◽

Vol 19 (12) ◽

pp. 4039 ◽

Cited By ~ 5

Author(s):

Mi-Li Liu ◽

Wei-Bing Fan ◽

Ning Wang ◽

Peng-Bin Dong ◽

Ting-Ting Zhang ◽

...

Keyword(s):

Molecular Genetic ◽

Phylogenetic Reconstruction ◽

Sequence Divergence ◽

Genomic Analysis ◽

Repeat Sequence ◽

Biological Research ◽

Comparative Genomic ◽

Sequence Evolution ◽

Evolutionary Analysis ◽

Plastid Genomes

Plant plastomes play crucial roles in species evolution and phylogenetic reconstruction studies due to being maternally inherited and due to the moderate evolutionary rate of genomes. However, patterns of sequence divergence and molecular evolution of the plastid genomes in the horticulturally- and economically-important Lonicera L. species are poorly understood. In this study, we collected the complete plastomes of seven Lonicera species and determined the various repeat sequence variations and protein sequence evolution by comparative genomic analysis. A total of 498 repeats were identified in plastid genomes, which included tandem (130), dispersed (277), and palindromic (91) types of repeat variations. Simple sequence repeat (SSR) elements analysis indicated the enriched SSRs in seven genomes to be mononucleotides, followed by tetra-nucleotides, dinucleotides, tri-nucleotides, hex-nucleotides, and penta-nucleotides. We identified 18 divergence hotspot regions (rps15, rps16, rps18, rpl23, psaJ, infA, ycf1, trnN-GUU-ndhF, rpoC2-rpoC1, rbcL-psaI, trnI-CAU-ycf2, psbZ-trnG-UCC, trnK-UUU-rps16, infA-rps8, rpl14-rpl16, trnV-GAC-rrn16, trnL-UAA intron, and rps12-clpP) that could be used as the potential molecular genetic markers for the further study of population genetics and phylogenetic evolution of Lonicera species. We found that a large number of repeat sequences were distributed in the divergence hotspots of plastid genomes. Interestingly, 16 genes were determined under positive selection, which included four genes for the subunits of ribosome proteins (rps7, rpl2, rpl16, and rpl22), three genes for the subunits of photosystem proteins (psaJ, psbC, and ycf4), three NADH oxidoreductase genes (ndhB, ndhH, and ndhK), two subunits of ATP genes (atpA and atpB), and four other genes (infA, rbcL, ycf1, and ycf2). Phylogenetic analysis based on the whole plastome demonstrated that the seven Lonicera species form a highly-supported monophyletic clade. The availability of these plastid genomes provides important genetic information for further species identification and biological research on Lonicera.

Download Full-text

Techniques for the verification of minimal phylogenetic trees illustrated with ten mammalian haemoglobin sequences

Biochemical Journal ◽

10.1042/bj1870065 ◽

1980 ◽

Vol 187 (1) ◽

pp. 65-74 ◽

Cited By ~ 12

Author(s):

D Penny ◽

M D Hendy ◽

L R Foulds

Keyword(s):

Amino Acid ◽

Phylogenetic Tree ◽

Protein Sequence ◽

Phylogenetic Trees ◽

Sequence Data ◽

Protein Sequences ◽

Nucleotide Sequences ◽

Amino Acid Sequences ◽

Minimal Tree ◽

Protein Sequence Data

We have recently reported a method to identify the shortest possible phylogenetic tree for a set of protein sequences [Foulds Hendy & Penny (1979) J. Mol. Evol. 13. 127–150; Foulds, Penny & Hendy (1979) J. Mol. Evol. 13, 151–166]. The present paper discusses issues that arise during the construction of minimal phylogenetic trees from protein-sequence data. The conversion of the data from amino acid sequences into nucleotide sequences is shown to be advantageous. A new variation of a method for constructing a minimal tree is presented. Our previous methods have involved first constructing a tree and then either proving that it is minimal or transforming it into a minimal tree. The approach presented in the present paper progressively builds up a tree, taxon by taxon. We illustrate this approach by using it to construct a minimal tree for ten mammalian haemoglobin alpha-chain sequences. Finally we define a measure of the complexity of the data and illustrate a method to derive a directed phylogenetic tree from the minimal tree.

Download Full-text

Hyperparameters optimization for ResNet and Xception in the purpose of diagnosing COVID-19

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-210925 ◽

2021 ◽

pp. 1-17

Author(s):

Hania H. Farag ◽

Lamiaa A. A. Said ◽

Mohamed R. M. Rizk ◽

Magdy Abd ElAzim Ahmed

Keyword(s):

Neural Network ◽

Neural Networks ◽

Deep Learning ◽

Convolutional Neural Network ◽

Random Search ◽

Learning Networks ◽

Residual Network ◽

Global Pandemic ◽

Search Optimization

COVID-19 has been considered as a global pandemic. Recently, researchers are using deep learning networks for medical diseases’ diagnosis. Some of these researches focuses on optimizing deep learning neural networks for enhancing the network accuracy. Optimizing the Convolutional Neural Network includes testing various networks which are obtained through manually configuring their hyperparameters, then the configuration with the highest accuracy is implemented. Each time a different database is used, a different combination of the hyperparameters is required. This paper introduces two COVID-19 diagnosing systems using both Residual Network and Xception Network optimized by random search in the purpose of finding optimal models that give better diagnosis rates for COVID-19. The proposed systems showed that hyperparameters tuning for the ResNet and the Xception Net using random search optimization give more accurate results than other techniques with accuracies 99.27536% and 100 % respectively. We can conclude that hyperparameters tuning using random search optimization for either the tuned Residual Network or the tuned Xception Network gives better accuracies than other techniques diagnosing COVID-19.

Download Full-text

phastSim: efficient simulation of sequence evolution for pandemic-scale datasets

10.1101/2021.03.15.435416 ◽

2021 ◽

Author(s):

Nicola De Maio ◽

Lukas Weilguny ◽

Conor R. Walker ◽

Yatish Turakhia ◽

Russell Corbett-Detig ◽

...

Keyword(s):

Sequence Data ◽

Search Tree ◽

Sequence Evolution ◽

Genomic Epidemiology ◽

Efficient Simulation ◽

Processing Power ◽

Large Trees ◽

Easy Integration ◽

Inference Methods ◽

High Computational Efficiency

AbstractSequence simulators are fundamental tools in bioinformatics, as they allow us to test data processing and inference tools, as well as being part of some inference methods. The ongoing surge in available sequence data is however testing the limits of our bioinformatics software. One example is the large number of SARS-CoV-2 genomes available, which are beyond the processing power of many methods, and simulating such large datasets is also proving difficult. Here we present a new algorithm and software for efficiently simulating sequence evolution along extremely large trees (e.g. > 100, 000 tips) when the branches of the tree are short, as is typical in genomic epidemiology. Our algorithm is based on the Gillespie approach, and implements an efficient multi-layered search tree structure that provides high computational efficiency by taking advantage of the fact that only a small proportion of the genome is likely to mutate at each branch of the considered phylogeny. Our open source software is available fromhttps://github.com/NicolaDM/phastSimand allows easy integration with other Python packages as well as a variety of evolutionary models, including new ones that we developed to more realistically model SARS-CoV-2 genome evolution.

Download Full-text

Disentangling selection on genetically correlated polygenic traits using whole-genome genealogies

10.1101/2020.05.07.083402 ◽

2020 ◽

Author(s):

Aaron J. Stern ◽

Leo Speidel ◽

Noah A. Zaitlen ◽

Rasmus Nielsen

Keyword(s):

Dna Sequence ◽

Population Genetic ◽

Sequence Data ◽

Directional Selection ◽

Likelihood Method ◽

Correlated Response ◽

Correlated Traits ◽

Dna Sequence Data ◽

Polygenic Traits ◽

Polygenic Trait

AbstractWe present a full-likelihood method to estimate and quantify polygenic adaptation from contemporary DNA sequence data. The method combines population genetic DNA sequence data and GWAS summary statistics from up to thousands of nucleotide sites in a joint likelihood function to estimate the strength of transient directional selection acting on a polygenic trait. Through population genetic simulations of polygenic trait architectures and GWAS, we show that the method substantially improves power over current methods. We examine the robustness of the method under uncorrected GWAS stratification, uncertainty and ascertainment bias in the GWAS estimates of SNP effects, uncertainty in the identification of causal SNPs, allelic heterogeneity, negative selection, and low GWAS sample size. The method can quantify selection acting on correlated traits, fully controlling for pleiotropy even among traits with strong genetic correlation (|rg| = 80%; c.f. schizophrenia and bipolar disorder) while retaining high power to attribute selection to the causal trait. We apply the method to study 56 human polygenic traits for signs of recent adaptation. We find signals of directional selection on pigmentation (tanning, sunburn, hair, P=5.5e-15, 1.1e-11, 2.2e-6, respectively), life history traits (age at first birth, EduYears, P=2.5e-4, 2.6e-4, respectively), glycated hemoglobin (HbA1c, P=1.2e-3), bone mineral density (P=1.1e-3), and neuroticism (P=5.5e-3). We also conduct joint testing of 137 pairs of genetically correlated traits. We find evidence of widespread correlated response acting on these traits (2.6-fold enrichment over the null expectation, P=1.5e-7). We find that for several traits previously reported as adaptive, such as educational attainment and hair color, a significant proportion of the signal of selection on these traits can be attributed to correlated response, vs direct selection (P=2.9e-6, 1.7e-4, respectively). Lastly, our joint test uncovers antagonistic selection that has acted to increase type 2 diabetes (T2D) risk and decrease HbA1c (P=1.5e-5).

Download Full-text

Significant cross-species gene flow detected in the Tamias quadrivittatus group of North American chipmunks

10.1101/2021.12.07.471567 ◽

2021 ◽

Author(s):

Jiayi Ji ◽

Donavan J. Jackson ◽

Adam D. Leaché ◽

Ziheng Yang

Keyword(s):

Gene Flow ◽

Sequence Data ◽

Nuclear Genome ◽

Likelihood Method ◽

Recent Analysis ◽

The Past ◽

Rapid Speciation ◽

Full Likelihood ◽

Nuclear Loci ◽

Robust Evidence

In the past two decades genomic data have been widely used to detect historical gene flow between species in a variety of plants and animals. The Tamias quadrivittatus group of North America chipmunks, which originated through a series of rapid speciation events, are known to undergo massive amounts of mitochondrial introgression. Yet in a recent analysis of targeted nuclear loci from the group, no evidence for cross-species introgression was detected, indicating widespread cytonuclear discordance. The study used heuristic methods that analyze summaries of the multilocus sequence data to detect gene flow, which may suffer from low power. Here we use the full likelihood method implemented in the Bayesian program BPP to reanalyze these data. We take a stepwise approach to constructing an introgression model by adding introgression events onto a well-supported binary species tree. The analysis detected robust evidence for multiple ancient introgression events affecting the nuclear genome, with introgression probabilities reaching 65%. We estimate population parameters and highlight the fact that species divergence times may be seriously underestimated if ancient cross-species gene flow is ignored in the analysis. Our analyses highlight the importance of using adequate statistical methods to reach reliable biological conclusions concerning cross-species gene flow.

Download Full-text