On the effect of phylogenetic correlations in coevolution-based contact prediction in proteins

Coevolution-based contact prediction, either directly by coevolutionary couplings resulting from global statistical sequence models or using structural supervision and deep learning, has found widespread application in protein-structure prediction from sequence. However, one of the basic assumptions in global statistical modeling is that sequences form an at least approximately independent sample of an unknown probability distribution, which is to be learned from data. In the case of protein families, this assumption is obviously violated by phylogenetic relations between protein sequences. It has turned out to be notoriously difficult to take phylogenetic correlations into account in coevolutionary model learning. Here, we propose a complementary approach: we develop strategies to randomize or resample sequence data, such that conservation patterns and phylogenetic relations are preserved, while intrinsic (i.e. structure- or function-based) coevolutionary couplings are removed. A comparison between the results of Direct Coupling Analysis applied to real and to resampled data shows that the largest coevolutionary couplings, i.e. those used for contact prediction, are only weakly influenced by phylogeny. However, the phylogeny-induced spurious couplings in the resampled data are compatible in size with the first false-positive contact predictions from real data. Dissecting functional from phylogeny-induced couplings might therefore extend accurate contact predictions to the range of intermediate-size couplings.

Download Full-text

Phylogenetic correlations have limited effect on coevolution-based contact prediction in proteins

10.1101/2020.08.12.247577 ◽

2020 ◽

Cited By ~ 1

Author(s):

Edwin Rodriguez Horta ◽

Martin Weigt

Keyword(s):

Protein Structure ◽

Amino Acid ◽

Protein Structure Prediction ◽

Structure Prediction ◽

Protein Sequences ◽

Protein Families ◽

Coupling Analysis ◽

Contact Prediction ◽

Phylogenetic Relations ◽

Direct Coupling Analysis

AbstractCoevolution-based contact prediction, either directly by coevolutionary couplings resulting from global statistical sequence models or using structural supervision and deep learning, has found widespread application in protein-structure prediction from sequence. However, one of the basic assumptions in global statistical modeling is that sequences form an at least approximately independent sample of an unknown probability distribution, which is to be learned from data. In the case of protein families, this assumption is obviously violated by phylogenetic relations between protein sequences. It has turned out to be notoriously difficult to take phylogenetic correlations into account in coevolutionary model learning. Here, we propose a complementary approach: we develop two strategies to randomize or resample sequence data, such that conservation patterns and phylogenetic relations are preserved, while intrinsic (i.e. structure- or function-based) coevolutionary couplings are removed. An analysis of these data shows that the strongest coevolutionary couplings, i.e. those used by Direct Coupling Analysis to predict contacts, are only weakly influenced by phylogeny. However, phylogeny-induced spurious couplings are of similar size to the bulk of coevolutionary couplings, and dissecting functional from phylogeny-induced couplings might lead to more accurate contact predictions in the range of intermediate-size couplings.The code is available at https://github.com/ed-rodh/Null_models_I_and_II.Author summaryMany homologous protein families contain thousands of highly diverged amino-acid sequences, which fold in close-to-identical three-dimensional structures and fulfill almost identical biological tasks. Global coevolutionary models, like those inferred by the Direct Coupling Analysis (DCA), assume that families can be considered as samples of some unknown statistical model, and that the parameters of these models represent evolutionary constraints acting on protein sequences. To learn these models from data, DCA and related approaches have to also assume that the distinct sequences in a protein family are close to independent, while in reality they are characterized by involved hierarchical phylogenetic relationships. Here we propose Null models for sequence alignments, which maintain patterns of amino-acid conservation and phylogeny contained in the data, but destroy any coevolutionary couplings, frequently used in protein structure prediction. We find that phylogeny actually induces spurious non-zero couplings. These are, however, significantly smaller that the largest couplings derived from natural sequences, and therefore have only little influence on the first predicted contacts. However, in the range of intermediate couplings, they may lead to statistically significant effects. Dissecting phylogenetic from functional couplings might therefore extend the range of accurately predicted structural contacts down to smaller coupling strengths than those currently used.

Download Full-text

Evaluating DCA-based method performances for RNA contact prediction by a well-curated dataset

10.1101/822023 ◽

2019 ◽

Author(s):

F. Pucci ◽

M. Zerihun ◽

E. Peter ◽

A. Schug

Keyword(s):

Rna Structure ◽

Structure Prediction ◽

Sequence Data ◽

Three Dimensional ◽

Spatial Proximity ◽

Dimensional Structure ◽

Rna Sequences ◽

Contact Prediction ◽

Rna Molecules ◽

Quality Of Structure

AbstractRNA molecules play many pivotal roles in the cellular functioning that are still not fully understood. Any detailed understanding of RNA function requires knowledge of its three-dimensional structure, yet experimental RNA structure resolution remains demanding. Recent advances in sequencing provide unprecedented amounts of sequence data that can be statistically analysed by methods such as Direct Coupling Analysis (DCA) to determine spatial proximity or contacts of specific nucleic acid pairs, which improve the quality of structure prediction. To quantify this structure prediction improvement, we here present a well curated dataset of about seventy RNA structures with high resolution and compare different nucleotide-nucleotide contact prediction methods available in the literature. We observe only minor difference between the performances of the different methods. Moreover, we discuss how these predictions are robust for different contact definitions and how strongly depend on procedures used to curate and align the families of homologous RNA sequences.

Download Full-text

The whole is greater than its parts: ensembling improves protein contact prediction

Scientific Reports ◽

10.1038/s41598-021-87524-0 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Wendy M. Billings ◽

Connor J. Morris ◽

Dennis Della Corte

Keyword(s):

Neural Networks ◽

Structure Prediction ◽

Deep Neural Networks ◽

Protein Structures ◽

Network Models ◽

Critical Assessment ◽

Neural Network Models ◽

Contact Prediction ◽

Homologous Sequences ◽

Contact Predictions

AbstractThe prediction of amino acid contacts from protein sequence is an important problem, as protein contacts are a vital step towards the prediction of folded protein structures. We propose that a powerful concept from deep learning, called ensembling, can increase the accuracy of protein contact predictions by combining the outputs of different neural network models. We show that ensembling the predictions made by different groups at the recent Critical Assessment of Protein Structure Prediction (CASP13) outperforms all individual groups. Further, we show that contacts derived from the distance predictions of three additional deep neural networks—AlphaFold, trRosetta, and ProSPr—can be substantially improved by ensembling all three networks. We also show that ensembling these recent deep neural networks with the best CASP13 group creates a superior contact prediction tool. Finally, we demonstrate that two ensembled networks can successfully differentiate between the folds of two highly homologous sequences. In order to build further on these findings, we propose the creation of a better protein contact benchmark set and additional open-source contact prediction methods.

Download Full-text

Frequent Closed Partial Orders Mining in Sequences

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.846-847.1304 ◽

2013 ◽

Vol 846-847 ◽

pp. 1304-1307

Author(s):

Ye Wang ◽

Yan Jia ◽

Lu Min Zhang

Keyword(s):

Sequence Data ◽

Real Data ◽

Partial Orders ◽

Hard Problem ◽

Important Data ◽

Data Set ◽

Pruning Algorithm ◽

Equal Chance ◽

Np Hard Problem ◽

General Sequences

Mining partial orders from sequence data is an important data mining task with broad applications. As partial orders mining is a NP-hard problem, many efficient pruning algorithm have been proposed. In this paper, we improve a classical algorithm of discovering frequent closed partial orders from string. For general sequences, we consider items appearing together having equal chance to calculate the detecting matrix used for pruning. Experimental evaluations from a real data set show that our algorithm can effectively mine FCPO from sequences.

Download Full-text

Modeling the Process of Event Sequence Data Generated for Working Condition Diagnosis

Mathematical Problems in Engineering ◽

10.1155/2015/693450 ◽

2015 ◽

Vol 2015 ◽

pp. 1-13

Author(s):

Jianwei Ding ◽

Yingbo Liu ◽

Li Zhang ◽

Jianmin Wang

Keyword(s):

Working Condition ◽

Sequence Data ◽

A Priori ◽

Real Data ◽

Data Sets ◽

Main Task ◽

Event Sequence ◽

Telemetry Data ◽

Condition Monitoring Systems ◽

Condition Diagnosis

Condition monitoring systems are widely used to monitor the working condition of equipment, generating a vast amount and variety of telemetry data in the process. The main task of surveillance focuses on analyzing these routinely collected telemetry data to help analyze the working condition in the equipment. However, with the rapid increase in the volume of telemetry data, it is a nontrivial task to analyze all the telemetry data to understand the working condition of the equipment without any a priori knowledge. In this paper, we proposed a probabilistic generative model called working condition model (WCM), which is capable of simulating the process of event sequence data generated and depicting the working condition of equipment at runtime. With the help of WCM, we are able to analyze how the event sequence data behave in different working modes and meanwhile to detect the working mode of an event sequence (working condition diagnosis). Furthermore, we have applied WCM to illustrative applications like automated detection of an anomalous event sequence for the runtime of equipment. Our experimental results on the real data sets demonstrate the effectiveness of the model.

Download Full-text

Modeling of Psychomotor Reactions of a Person Based on Modification of the Tapping Test

International Journal of Computing ◽

10.47839/ijc.20.2.2166 ◽

2021 ◽

pp. 190-200

Author(s):

Lesia Mochurad ◽

Yaroslav Hladun

Keyword(s):

Neural Network ◽

Machine Learning ◽

Time Series ◽

Real Data ◽

Finger Tapping ◽

Similar Distribution ◽

Model Learning ◽

Machine Learning Model ◽

Finger Tapping Test

The paper considers the method for analysis of a psychophysical state of a person on psychomotor indicators – finger tapping test. The app for mobile phone that generalizes the classic tapping test is developed for experiments. Developed tool allows collecting samples and analyzing them like individual experiments and like dataset as a whole. The data based on statistical methods and optimization of hyperparameters is investigated for anomalies, and an algorithm for reducing their number is developed. The machine learning model is used to predict different features of the dataset. These experiments demonstrate the data structure obtained using finger tapping test. As a result, we gained knowledge of how to conduct experiments for better generalization of the model in future. A method for removing anomalies is developed and it can be used in further research to increase an accuracy of the model. Developed model is a multilayer recurrent neural network that works well with the classification of time series. Error of model learning on a synthetic dataset is 1.5% and on a real data from similar distribution is 5%.

Download Full-text

Phylogenetic Relations of Rhizoplaca Zopf. from Anatolia Inferred from ITS Sequence Data

Zeitschrift für Naturforschung C ◽

10.1515/znc-2006-5-617 ◽

2006 ◽

Vol 61 (5-6) ◽

pp. 405-412 ◽

Cited By ~ 4

Author(s):

Demet Cansaran ◽

Sümer Aras ◽

İrfan Kandemir ◽

Gökhan Halıcı

Keyword(s):

Genetic Variability ◽

Phylogenetic Tree ◽

Sequence Data ◽

Volcanic Rocks ◽

Morphological Characteristics ◽

Its Rdna ◽

Its Sequence ◽

Phylogenetic Relations ◽

Rdna Sequences

Like many lichen-forming fungi, species of the genus Rhizoplaca have wide geographical distributions, but studies of their genetic variability are limited. The information about the ITS rDNA sequences of three species of Rhizoplaca from Anatolia was generated and aligned with other species from other countries and also with the data belonging to Lecanora species. The examined species were collected from the volcanic rocks of Mount Erciyes which is located in the middle of Anatolia (Turkey). The sequence data aligned with eight other samples of Rhizoplaca and six different species of Lecanora were obtained from GenBank. The results support the concept maintained by Arup and Grube (2000) that Rhizoplaca may not be a genus separate from Lecanora. According to the phylogenetic tree, Rhizoplaca melanopthalma from Turkey with two different samples of R. melanopthalma from Arizona (AF159929, AF159934) and a sample from Austria formed a group under the same branch. R. peltata and R. chrysoleuca samples from Anatolia located in two other branches of the tree formed sister groups with the samples of the same species from different countries. Although R. peltata remained on the same branch with other samples of the same species from other countries it was placed in a different branch within the group. When the three species from Anatolia were considered alone, it was noticed that Rhizoplaca melanopthalma and Rhizoplaca peltata are phylogenetically closer to each other than Rhizoplaca chrysoleuca; the morphological characteristics also support this result.

Download Full-text

Improved Compression of DNA Sequencing Data with Cascading Bloom Filters

International Journal of Foundations of Computer Science ◽

10.1142/s0129054118430013 ◽

2018 ◽

Vol 29 (08) ◽

pp. 1249-1255

Author(s):

Kamil Salikhov

Keyword(s):

Dna Sequencing ◽

Sequence Data ◽

Real Data ◽

Compression Algorithm ◽

Computational Experiments ◽

Bloom Filters ◽

Dna Fragments ◽

Sequencing Data ◽

Sequencing Technologies ◽

Memory Reduction

Modern DNA sequencing technologies generate prodigious volumes of sequence data consisting of short DNA fragments (reads). Storing and transferring this data is often challenging. With this motivation, several specialized compression methods have been developed. In this paper, we present an improvement of the lossless reference-free compression algorithm, suggested by Rozov et al., based on the technique of cascading Bloom filters. Through computational experiments on real data, we demonstrate that our method results in a significant associated memory reduction in practice.

Download Full-text

MPF–BML: a standalone GUI-based package for maximum entropy model inference

Bioinformatics ◽

10.1093/bioinformatics/btz925 ◽

2019 ◽

Vol 36 (7) ◽

pp. 2278-2279

Author(s):

Ahmed A Quadeer ◽

Matthew R McKay ◽

John P Barton ◽

Raymond H Y Louie

Keyword(s):

Maximum Entropy ◽

Fitness Landscape ◽

Sequence Data ◽

Surface Proteins ◽

Supplementary Information ◽

Model Parameters ◽

Maximum Entropy Model ◽

Entropy Model ◽

Widespread Application ◽

Maximum Entropy Models

Abstract Summary Learning underlying correlation patterns in data is a central problem across scientific fields. Maximum entropy models present an important class of statistical approaches for addressing this problem. However, accurately and efficiently inferring model parameters are a major challenge, particularly for modern high-dimensional applications such as in biology, for which the number of parameters is enormous. Previously, we developed a statistical method, minimum probability flow–Boltzmann Machine Learning (MPF–BML), for performing fast and accurate inference of maximum entropy model parameters, which was applied to genetic sequence data to estimate the fitness landscape for the surface proteins of human immunodeficiency virus and hepatitis C virus. To facilitate seamless use of MPF–BML and encourage more widespread application to data in diverse fields, we present a standalone cross-platform package of MPF–BML which features an easy-to-use graphical user interface. The package only requires the input data (protein sequence data or data of multiple configurations of a complex system with large number of variables) and returns the maximum entropy model parameters. Availability and implementation The MPF–BML software is publicly available under the MIT License at https://github.com/ahmedaq/MPF-BML-GUI. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Session-Based Webshell Detection Using Machine Learning in Web Logs

Security and Communication Networks ◽

10.1155/2019/3093809 ◽

2019 ◽

Vol 2019 ◽

pp. 1-11 ◽

Cited By ~ 3

Author(s):

Yixin Wu ◽

Yuqiang Sun ◽

Cheng Huang ◽

Peng Jia ◽

Luping Liu

Keyword(s):

Statistical Method ◽

High Efficiency ◽

Feature Matching ◽

Short Term Memory ◽

Sequence Data ◽

Real Data ◽

Recall Rate ◽

Time Interval ◽

Web Logs ◽

Gradual Development

Attackers upload webshell into a web server to achieve the purpose of stealing data, launching a DDoS attack, modifying files with malicious intentions, etc. Once these objects are accomplished, it will bring huge losses to website managers. With the gradual development of encryption and confusion technology, the most common detection approach using taint analysis and feature matching might become less useful. Instead of applying source file codes, POST contents, or all received traffic, this paper demonstrated an intelligent and efficient framework that employs precise sessions derived from the web logs to detect webshell communication. Features were extracted from the raw sequence data in web logs while a statistical method based on time interval was proposed to identify sessions specifically. Besides, the paper leveraged long short-term memory and hidden Markov model to constitute the framework, respectively. Finally, the framework was evaluated with real data. The experiment shows that the LSTM-based model can achieve a higher accuracy rate of 95.97% with a recall rate of 96.15%, which has a much better performance than the HMM-based model. Moreover, the experiment demonstrated the high efficiency of the proposed approach in terms of the quick detection without source code, especially when it only considers detecting for a period of time, as it takes 98.5% less time than the cited related approach to get the result. As long as the webshell behavior is detected, we can pinpoint the anomaly session and utilize the statistical method to find the webshell file accurately.

Download Full-text