Fast and adaptive protein structure representations for machine learning

The growing prevalence and popularity of protein structure data, both experimental and computationally modelled, necessitates fast tools and algorithms to enable exploratory and interpretable structure-based machine learning. Alignment-free approaches have been developed for divergent proteins, but proteins sharing functional and structural similarity are often better understood via structural alignment, which has typically been too computationally expensive for larger datasets. Here, we introduce the concept of rotation-invariant shape-mers to multiple structure alignment, creating a structure aligner that scales well with the number of proteins and allows for aligning over a thousand structures in 20 minutes. We demonstrate how alignment-free shape-mer counts and aligned structural features, when used in machine learning tasks, can adapt to different levels of functional hierarchy in protein kinases, pinpointing residues and structural fragments that play a role in catalytic activity.

Download Full-text

GADP-align: A genetic algorithm and dynamic programming-based method for structural alignment of proteins

Bioimpacts ◽

10.34172/bi.2021.37 ◽

2020 ◽

Vol 11 (4) ◽

pp. 271-279

Author(s):

Soraya Mirzaei ◽

Jafar Razmara ◽

Shahriar Lotfi

Keyword(s):

Genetic Algorithm ◽

Dynamic Programming ◽

Protein Structure ◽

Hybrid Method ◽

Structural Alignment ◽

Dynamic Programming Algorithm ◽

Structure Alignment ◽

Programming Algorithm ◽

Programming Technique ◽

Iterative Dynamic Programming

Introduction: Similarity analysis of protein structure is considered as a fundamental step to give insight into the relationships between proteins. The primary step in structural alignment is looking for the optimal correspondence between residues of two structures to optimize the scoring function. An exhaustive search for finding such a correspondence between two structures is intractable. Methods: In this paper, a hybrid method is proposed, namely GADP-align, for pairwise protein structure alignment. The proposed method looks for an optimal alignment using a hybrid method based on a genetic algorithm and an iterative dynamic programming technique. To this end, the method first creates an initial map of correspondence between secondary structure elements (SSEs) of two proteins. Then, a genetic algorithm combined with an iterative dynamic programming algorithm is employed to optimize the alignment. Results: The GADP-align algorithm was employed to align 10 ‘difficult to align’ protein pairs in order to evaluate its performance. The experimental study shows that the proposed hybrid method produces highly accurate alignments in comparison with the methods using exactly the dynamic programming technique. Furthermore, the proposed method prevents the local optimal traps caused by the unsuitable initial guess of the corresponding residues. Conclusion: The findings of this paper demonstrate that employing the genetic algorithm along with the dynamic programming technique yields highly accurate alignments between a protein pair by exploring the global alignment and avoiding trapping in local alignments.

Download Full-text

MADOKA: an ultra-fast approach for large-scale protein structure similarity searching

BMC Bioinformatics ◽

10.1186/s12859-019-3235-1 ◽

2019 ◽

Vol 20 (S19) ◽

Cited By ~ 2

Author(s):

Lei Deng ◽

Guolun Zhong ◽

Chenzhe Liu ◽

Judong Luo ◽

Hui Liu

Keyword(s):

Protein Structure ◽

Large Scale ◽

Parallel Implementation ◽

Structural Alignment ◽

3D Structure ◽

Web Server ◽

Structure Alignment ◽

Two Phase ◽

Fast Approach ◽

Structure Similarity

Abstract Background Protein comparative analysis and similarity searches play essential roles in structural bioinformatics. A couple of algorithms for protein structure alignments have been developed in recent years. However, facing the rapid growth of protein structure data, improving overall comparison performance and running efficiency with massive sequences is still challenging. Results Here, we propose MADOKA, an ultra-fast approach for massive structural neighbor searching using a novel two-phase algorithm. Initially, we apply a fast alignment between pairwise structures. Then, we employ a score to select pairs with more similarity to carry out a more accurate fragment-based residue-level alignment. MADOKA performs about 6–100 times faster than existing methods, including TM-align and SAL, in massive alignments. Moreover, the quality of structural alignment of MADOKA is better than the existing algorithms in terms of TM-score and number of aligned residues. We also develop a web server to search structural neighbors in PDB database (About 360,000 protein chains in total), as well as additional features such as 3D structure alignment visualization. The MADOKA web server is freely available at: http://madoka.denglab.org/ Conclusions MADOKA is an efficient approach to search for protein structure similarity. In addition, we provide a parallel implementation of MADOKA which exploits massive power of multi-core CPUs.

Download Full-text

OPTIMAL PAIRWISE ALIGNMENT OF FIXED PROTEIN STRUCTURES IN SUBQUADRATIC TIME

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720011005562 ◽

2011 ◽

Vol 09 (03) ◽

pp. 367-382 ◽

Cited By ~ 7

Author(s):

ALEKSANDAR POLEKSIC

Keyword(s):

Protein Structure ◽

Structural Alignment ◽

Protein Structures ◽

Dynamic Programming Algorithm ◽

Pairwise Alignment ◽

Structure Alignment ◽

Programming Algorithm ◽

Protein Structure Alignment ◽

Running Time ◽

Speed Accuracy

The problem of finding an optimal structural alignment for a pair of superimposed proteins is often amenable to the Smith–Waterman dynamic programming algorithm, which runs in time proportional to the product of lengths of the sequences being aligned. While the quadratic running time is acceptable for computing a single alignment of two fixed protein structures, the time complexity becomes a bottleneck when running the Smith–Waterman routine multiple times in order to find a globally optimal superposition and alignment of the input proteins. We present a subquadratic running time algorithm capable of computing an alignment that optimizes one of the most widely used measures of protein structure similarity, defined as the number of pairs of residues in two proteins that can be superimposed under a predefined distance cutoff. The algorithm presented in this article can be used to significantly improve the speed–accuracy tradeoff in a number of popular protein structure alignment methods.

Download Full-text

Metallothionein: Protein structure prediction and sequence analyses in pigeon pea(Cajanuscajan)

Journal of AgriSearch ◽

10.21921/jas.v4i04.10209 ◽

2017 ◽

Vol 4 (04) ◽

Author(s):

Sakshi Chaudhary ◽

Anil Kumar Singh ◽

Jeshima Khan Yasin

Keyword(s):

Protein Structure ◽

Metal Binding ◽

Structure Prediction ◽

Structural Similarity ◽

Pigeon Pea ◽

Structural Features ◽

Structural Domain ◽

Small Proteins ◽

Heavy Metal Binding ◽

Entire Sequence

Metallothioneins are a special group of small proteins capable of detoxifying non-essential metal ions present in excess within a plant cell. Metallothioneins are cysteine-rich diverse classes of heavy metal binding protein molecules which are essential for plant growth.These proteins are present in all taxa, except eubacteria. The similarity in protein sequences provides a basis for the method which predicts structural features of a protein with that of a known protein structure. Structural similarity of entire sequence or large sequence fragment enables prediction and modeling of entire structural domain, while distribution of local features of known protein structure make it possible to predict such features in structure of unknown or uncharacterised proteins.In this study, from available genomic resources metallothionein of pigeonpea was identified, structure of metallothionein was predicted and validated. We have presented a step-wise methodology to model a given protein and to validate the structures.

Download Full-text

Combining electronic and structural features in machine learning models to predict organic solar cells properties

Materials Horizons ◽

10.1039/c8mh01135d ◽

2019 ◽

Vol 6 (2) ◽

pp. 343-349 ◽

Cited By ~ 39

Author(s):

Daniele Padula ◽

Jack D. Simpson ◽

Alessandro Troisi

Keyword(s):

Machine Learning ◽

Solar Cells ◽

Organic Solar Cells ◽

Structural Similarity ◽

Structural Features ◽

Learning Models ◽

Learning Methods ◽

Machine Learning Methods ◽

Machine Learning Models

Combining electronic and structural similarity between organic donors in kernel based machine learning methods allows to predict photovoltaic efficiencies reliably.

Download Full-text

Predicting Protein Thermostability Upon Mutation Using Molecular Dynamics Timeseries Data

10.1101/078246 ◽

2016 ◽

Cited By ~ 1

Author(s):

Noah Fleming ◽

Benjamin Kinsella ◽

Christopher Ing

Keyword(s):

Neural Network ◽

Machine Learning ◽

Molecular Dynamics ◽

Protein Structure ◽

Protein Stability ◽

Recurrent Neural Network ◽

Molecular Basis ◽

Sequence Data ◽

Structural Features ◽

Machine Learning Algorithms

AbstractA large number of human diseases result from disruptions to protein structure and function caused by missense mutations. Computational methods are frequently employed to assist in the prediction of protein stability upon mutation. These methods utilize a combination of protein sequence data, protein structure data, empirical energy functions, and physicochemical properties of amino acids. In this work, we present the first use of dynamic protein structural features in order to improve stability predictions upon mutation. This is achieved through the use of a set of timeseries extracted from microsecond timescale atomistic molecular dynamics simulations of proteins. Standard machine learning algorithms using mean, variance, and histograms of these timeseries were found to be 60-70% accurate in stability classification based on experimental ΔΔGor protein-chaperone interaction measurements. A recurrent neural network with full treatment of timeseries data was found to be 80% accurate according the F1 score. The performance of our models was found to be equal or better than two recently developed machine learning methods for binary classification as well as two industry-standard stability prediction algorithms. In addition to classification, understanding the molecular basis of protein stability disruption due to disease-causing mutations is a significant challenge that impedes the development of drugs and therapies that may be used treat genetic diseases. The use of dynamic structural features allows for novel insight into the molecular basis of protein disruption by mutation in a diverse set of soluble proteins. To assist in the interpretation of machine learning results, we present a technique for determining the importance of features to a recurrent neural network using Garson’s method. We propose a novel extension of neural interpretation diagrams by implementing Garson’s method to scale each node in the neural interpretation diagram according to its relative importance to the network.

Download Full-text

Biological and In silico Studies on Synthetic Analogues of Tyrosine Betaine as Inhibitors of Neprilysin - A Drug Target for the Treatment of Heart Failure

Current Pharmaceutical Design ◽

10.2174/1381612824666180515114236 ◽

2018 ◽

Vol 24 (17) ◽

pp. 1899-1904

Author(s):

Daniel Fabio Kawano ◽

Marcelo Rodrigues de Carvalho ◽

Mauricio Ferreira Marcondes Machado ◽

Adriana Karaoglanovic Carmona ◽

Gilberto Ubida Leite Braga ◽

...

Keyword(s):

Heart Failure ◽

Secondary Metabolites ◽

Structural Similarity ◽

Structural Features ◽

Major Constituent ◽

Binding Modes ◽

Starting Point ◽

In Silico Studies ◽

Nep Inhibition

Background: Fungal secondary metabolites are important sources for the discovery of new pharmaceuticals, as exemplified by penicillin, lovastatin and cyclosporine. Searching for secondary metabolites of the fungi Metarhizium spp., we previously identified tyrosine betaine as a major constituent. Methods: Because of the structural similarity with other inhibitors of neprilysin (NEP), an enzyme explored for the treatment of heart failure, we devised the synthesis of tyrosine betaine and three analogues to be subjected to in vitro NEP inhibition assays and to molecular modeling studies. Results: In spite of the similar binding modes with other NEP inhibitors, these compounds only displayed moderate inhibitory activities (IC50 ranging from 170.0 to 52.9 µM). However, they enclose structural features required to hinder passive blood brain barrier permeation (BBB). Conclusions: Tyrosine betaine remains as a starting point for the development of NEP inhibitors because of the low probability of BBB permeation and, consequently, of NEP inhibition at the Central Nervous System, which is associated to an increment in the Aβ levels and, accordingly, with a higher risk for the onset of Alzheimer's disease.

Download Full-text

Protein Inter-Residue Contacts Prediction: Methods, Performances and Applications

Current Bioinformatics ◽

10.2174/1574893613666181109130430 ◽

2019 ◽

Vol 14 (3) ◽

pp. 178-189 ◽

Cited By ~ 3

Author(s):

Xiaoyang Jing ◽

Qimin Dong ◽

Ruqian Lu ◽

Qiwen Dong

Keyword(s):

Machine Learning ◽

Protein Structure ◽

Tertiary Structure ◽

Prediction Methods ◽

Learning Methods ◽

Typical Application ◽

Machine Learning Methods ◽

Residue Contacts ◽

Fusion Methods ◽

Correlated Mutations

Background:Protein inter-residue contacts prediction play an important role in the field of protein structure and function research. As a low-dimensional representation of protein tertiary structure, protein inter-residue contacts could greatly help de novo protein structure prediction methods to reduce the conformational search space. Over the past two decades, various methods have been developed for protein inter-residue contacts prediction.Objective:We provide a comprehensive and systematic review of protein inter-residue contacts prediction methods.Results:Protein inter-residue contacts prediction methods are roughly classified into five categories: correlated mutations methods, machine-learning methods, fusion methods, templatebased methods and 3D model-based methods. In this paper, firstly we describe the common definition of protein inter-residue contacts and show the typical application of protein inter-residue contacts. Then, we present a comprehensive review of the three main categories for protein interresidue contacts prediction: correlated mutations methods, machine-learning methods and fusion methods. Besides, we analyze the constraints for each category. Furthermore, we compare several representative methods on the CASP11 dataset and discuss performances of these methods in detail.Conclusion:Correlated mutations methods achieve better performances for long-range contacts, while the machine-learning method performs well for short-range contacts. Fusion methods could take advantage of the machine-learning and correlated mutations methods. Employing more effective fusion strategy could be helpful to further improve the performances of fusion methods.

Download Full-text

Descriptors of Cytochrome Inhibitors and Useful Machine Learning Based Methods for the Design of Safer Drugs

Pharmaceuticals ◽

10.3390/ph14050472 ◽

2021 ◽

Vol 14 (5) ◽

pp. 472

Author(s):

Tyler C. Beck ◽

Kyle R. Beck ◽

Jordan Morningstar ◽

Menny M. Benjamin ◽

Russell A. Norris

Keyword(s):

United States ◽

Machine Learning ◽

Drug Interactions ◽

The United States ◽

Structural Features ◽

Physiochemical Properties ◽

Drug Dosing ◽

Therapeutic Outcomes ◽

Cyp Inhibition ◽

Cyp Inhibitors

Roughly 2.8% of annual hospitalizations are a result of adverse drug interactions in the United States, representing more than 245,000 hospitalizations. Drug–drug interactions commonly arise from major cytochrome P450 (CYP) inhibition. Various approaches are routinely employed in order to reduce the incidence of adverse interactions, such as altering drug dosing schemes and/or minimizing the number of drugs prescribed; however, often, a reduction in the number of medications cannot be achieved without impacting therapeutic outcomes. Nearly 80% of drugs fail in development due to pharmacokinetic issues, outlining the importance of examining cytochrome interactions during preclinical drug design. In this review, we examined the physiochemical and structural properties of small molecule inhibitors of CYPs 3A4, 2D6, 2C19, 2C9, and 1A2. Although CYP inhibitors tend to have distinct physiochemical properties and structural features, these descriptors alone are insufficient to predict major cytochrome inhibition probability and affinity. Machine learning based in silico approaches may be employed as a more robust and accurate way of predicting CYP inhibition. These various approaches are highlighted in the review.

Download Full-text

Towards effective link prediction: A hybrid similarity model

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-200344 ◽

2020 ◽

pp. 1-14

Author(s):

Longjie Li ◽

Lu Wang ◽

Hongsheng Luo ◽

Xiaoyun Chen

Keyword(s):

Link Prediction ◽

Structural Similarity ◽

Research Direction ◽

Structural Features ◽

Proposed Model ◽

Similarity Model ◽

Weight Calculation ◽

Stable Performance ◽

Grey Relation ◽

Important Research Direction

Link prediction is an important research direction in complex network analysis and has drawn increasing attention from researchers in various fields. So far, a plethora of structural similarity-based methods have been proposed to solve the link prediction problem. To achieve stable performance on different networks, this paper proposes a hybrid similarity model to conduct link prediction. In the proposed model, the Grey Relation Analysis (GRA) approach is employed to integrate four carefully selected similarity indexes, which are designed according to different structural features. In addition, to adaptively estimate the weight for each index based on the observed network structures, a new weight calculation method is presented by considering the distribution of similarity scores. Due to taking separate similarity indexes into account, the proposed method is applicable to multiple different types of network. Experimental results show that the proposed method outperforms other prediction methods in terms of accuracy and stableness on 10 benchmark networks.

Download Full-text