scholarly journals LYRUS: A Machine Learning Model for Predicting the Pathogenicity of Missense Variants

2021 ◽  
Author(s):  
Jiaying Lai ◽  
Jordan Yang ◽  
Ece D Uzun ◽  
Brenda Rubenstein ◽  
Indra Neil Sarkar

Single amino acid variations (SAVs) are a primary contributor to variations in the human genome. Identifying pathogenic SAVs can aid in the diagnosis and understanding of the genetic architecture of complex diseases, such as cancer. Most approaches for predicting the functional effects or pathogenicity of SAVs rely on either sequence or structural information. Nevertheless, previous analyses have shown that methods that depend on only sequence or structural information may have limited accuracy. Recently, researchers have attempted to increase the accuracy of their predictions by incorporating protein dynamics into pathogenicity predictions. This study presents <Lai Yang Rubenstein Uzun Sarkar> (LYRUS), a machine learning method that uses an XGBoost classifier selected by TPOT to predict the pathogenicity of SAVs. LYRUS incorporates five sequence–based features, six structure–based features, and four dynamics–based features. Uniquely, LYRUS includes a newly–proposed sequence co–evolution feature called variation number. LYRUS's performance was evaluated using a dataset that contains 4,363 protein structures corresponding to 20,307 SAVs based on human genetic variant data from the ClinVar database. Based on our dataset, the LYRUS classifier has higher accuracy, specificity, F–measure, and Matthews correlation coefficient (MCC) than alternative methods including PolyPhen2, PROVEAN, SIFT, Rhapsody, EVMutation, MutationAssessor, SuSPect, FATHMM, and MVP. Variation numbers used within LYRUS differ greatly between pathogenic and neutral SAVs, and have a high feature weight in the XGBoost classifier employed by this method. Applications of the method to PTEN and TP53 further corroborate LYRUS's strong performance. LYRUS is freely available and the source code can be found at https://github.com/jiaying2508/LYRUS.

2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Truong Khanh Linh Dang ◽  
Thach Nguyen ◽  
Michael Habeck ◽  
Mehmet Gültas ◽  
Stephan Waack

Abstract Background Conformational transitions are implicated in the biological function of many proteins. Structural changes in proteins can be described approximately as the relative movement of rigid domains against each other. Despite previous efforts, there is a need to develop new domain segmentation algorithms that are capable of analysing the entire structure database efficiently and do not require the choice of protein-dependent tuning parameters such as the number of rigid domains. Results We develop a graph-based method for detecting rigid domains in proteins. Structural information from multiple conformational states is represented by a graph whose nodes correspond to amino acids. Graph clustering algorithms allow us to reduce the graph and run the Viterbi algorithm on the associated line graph to obtain a segmentation of the input structures into rigid domains. In contrast to many alternative methods, our approach does not require knowledge about the number of rigid domains. Moreover, we identified default values for the algorithmic parameters that are suitable for a large number of conformational ensembles. We test our algorithm on examples from the DynDom database and illustrate our method on various challenging systems whose structural transitions have been studied extensively. Conclusions The results strongly suggest that our graph-based algorithm forms a novel framework to characterize structural transitions in proteins via detecting their rigid domains. The web server is available at http://azifi.tz.agrar.uni-goettingen.de/webservice/.


PeerJ ◽  
2020 ◽  
Vol 8 ◽  
pp. e10381
Author(s):  
Rohit Nandakumar ◽  
Valentin Dinu

Throughout the history of drug discovery, an enzymatic-based approach for identifying new drug molecules has been primarily utilized. Recently, protein–protein interfaces that can be disrupted to identify small molecules that could be viable targets for certain diseases, such as cancer and the human immunodeficiency virus, have been identified. Existing studies computationally identify hotspots on these interfaces, with most models attaining accuracies of ~70%. Many studies do not effectively integrate information relating to amino acid chains and other structural information relating to the complex. Herein, (1) a machine learning model has been created and (2) its ability to integrate multiple features, such as those associated with amino-acid chains, has been evaluated to enhance the ability to predict protein–protein interface hotspots. Virtual drug screening analysis of a set of hotspots determined on the EphB2-ephrinB2 complex has also been performed. The predictive capabilities of this model offer an AUROC of 0.842, sensitivity/recall of 0.833, and specificity of 0.850. Virtual screening of a set of hotspots identified by the machine learning model developed in this study has identified potential medications to treat diseases caused by the overexpression of the EphB2-ephrinB2 complex, including prostate, gastric, colorectal and melanoma cancers which are linked to EphB2 mutations. The efficacy of this model has been demonstrated through its successful ability to predict drug-disease associations previously identified in literature, including cimetidine, idarubicin, pralatrexate for these conditions. In addition, nadolol, a beta blocker, has also been identified in this study to bind to the EphB2-ephrinB2 complex, and the possibility of this drug treating multiple cancers is still relatively unexplored.


2021 ◽  
Vol 15 (8) ◽  
pp. 878-888
Author(s):  
Yang Liu ◽  
Xia-hui Ouyang ◽  
Zhi-Xiong Xiao ◽  
Le Zhang ◽  
Yang Cao

Background: T lymphocyte achieves an immune response by recognizing antigen peptides (also known as T cell epitopes) through major histocompatibility complex (MHC) molecules. The immunogenicity of T cell epitopes depends on their source and stability in combination with MHC molecules. The binding of the peptide to MHC is the most selective step, so predicting the binding affinity of the peptide to MHC is the principal step in predicting T cell epitopes. The identification of epitopes is of great significance in the research of vaccine design and T cell immune response. Objective: The traditional method for identifying epitopes is to synthesize and test the binding activity of peptide by experimental methods, which is not only time-consuming, but also expensive. In silico methods for predicting peptide-MHC binding emerge to pre-select candidate peptides for experimental testing, which greatly saves time and costs. By summarizing and analyzing these methods, we hope to have a better insight and provide guidance for future directions. Methods: Up to now, a number of methods have been developed to predict the binding ability of peptides to MHC based on various principles. Some of them employ matrix models or machine learning models based on the sequence characteristic embedded in peptides or MHC to predict the binding ability of peptides to MHC. Some others utilize the three-dimensional structural information of peptides or MHC, for example, by extracting three-dimensional structural information to construct a feature matrix or machine learning model, or directly using protein structure prediction, molecular docking to predict the binding mode of peptides and MHC. Results: Although the methods in predicting peptide-MHC binding based on the feature matrix or machine learning model can achieve high-throughput prediction, the accuracy of which depends heavily on the sequence characteristic of confirmed binding peptides. In addition, it cannot provide insights into the mechanism of antigen specificity. Therefore, such methods have certain limitations in practical applications. Methods in predicting peptide-MHC binding based on structural prediction or molecular docking are computationally intensive compared to the methods based on feature matrix or machine learning model and the challenge is how to predict a reliable structural model. Conclusion: This paper reviews the principles, advantages and disadvantages of the methods of peptide-MHC binding prediction and discussed the future directions to achieve more accurate predictions.


2021 ◽  
Author(s):  
Philip Gauglitz ◽  
David Geiger ◽  
Jan Ulffers ◽  
Evamaria Zauner

&lt;div&gt; &lt;div&gt; &lt;p&gt;Considering climate change, it is essential to reduce CO&lt;sub&gt;2&lt;/sub&gt; emissions. The provision of charging infrastructure in public spaces for electromobility &amp;#8211; along with the substitution of conventional power generation with renewable energies &amp;#8211; can contribute to the energy transition in the transport sector. Scenarios for the spatial distribution of this charging infrastructure can help to exemplify the need for charging points and their impact, for example, on power grids. We present an approach based both on the usage frequency of points of interest (POIs) and on the need for charging points in residential areas. This approach is validated in several steps and compared with alternative methods, such as a machine learning model trained with existing charging point utilization data.&lt;/p&gt; &lt;p&gt;Our approach uses two drivers to model the demand for public charging infrastructure. The first driver represents the demand for more charging stations to compensate for the lack of home charging stations and is derived from a previously developed and published model addressing electric-vehicle ownership (with and without home charging options) in households. The second driver represents the demand for public charging infrastructure at POIs. Their locations are derived from Open Street Map (OSM) data and weighted based on an evaluation of movement profiles from the Mobilit&amp;#228;t in Deutschland survey (MiD, German for &amp;#8220;Mobility in Germany&amp;#8221;). We combine those two drivers with the available parking spaces and generate distributions for possible future charging points. For computational efficiency and speed, we use a raster-based approach in which all vector data is rasterized and computations are performed on the full grid of a municipality. The presented application area is Wiesbaden, Germany, and the methodology is generally applicable to municipalities in Germany.&lt;/p&gt; &lt;p&gt;The method is compared and validated with alternative approaches on several levels. First, the allocation of parking space based on the raster calculation is validated against parking space numbers available in OSM. Second, the modeling of charging points supposed to compensate for the lack of home charging opportunities is contrasted with a simplified procedure by means of an analysis of multifamily housing density. In the third validation step, the method is compared to an existing machine learning model that estimates spatial suitability for charging stations. This model is trained with numerous input datasets such as population density and POIs on the one hand and utilization data of existing charging stations on the other hand. The objective of these comparisons is both to generally verify our model&amp;#8217;s validity and to investigate the relative influence of specific components of the model.&lt;/p&gt; &lt;p&gt;The identification of potential charging points in public spaces plays an important role in modeling the future energy system &amp;#8211; especially the power grid &amp;#8211; as the rapid adoption of electric vehicles will shift locations of demand for electricity. With our investigation, we want to present a new method to simulate future public charging point locations and show the influences of different modeling methods.&lt;/p&gt; &lt;/div&gt; &lt;/div&gt;


RSC Advances ◽  
2020 ◽  
Vol 10 (28) ◽  
pp. 16607-16615
Author(s):  
Zhao Qin ◽  
Qingyi Yu ◽  
Markus J. Buehler

Natural vibrations and resonances are intrinsic features of protein structures and can be learnt from existing structures.


2021 ◽  
Vol 7 (1) ◽  
Author(s):  
Cody Kunka ◽  
Apaar Shanker ◽  
Elton Y. Chen ◽  
Surya R. Kalidindi ◽  
Rémi Dingreville

AbstractDiffraction techniques can powerfully and nondestructively probe materials while maintaining high resolution in both space and time. Unfortunately, these characterizations have been limited and sometimes even erroneous due to the difficulty of decoding the desired material information from features of the diffractograms. Currently, these features are identified non-comprehensively via human intuition, so the resulting models can only predict a subset of the available structural information. In the present work we show (i) how to compute machine-identified features that fully summarize a diffractogram and (ii) how to employ machine learning to reliably connect these features to an expanded set of structural statistics. To exemplify this framework, we assessed virtual electron diffractograms generated from atomistic simulations of irradiated copper. When based on machine-identified features rather than human-identified features, our machine-learning model not only predicted one-point statistics (i.e. density) but also a two-point statistic (i.e. spatial distribution) of the defect population. Hence, this work demonstrates that machine-learning models that input machine-identified features significantly advance the state of the art for accurately and robustly decoding diffractograms.


2018 ◽  
Vol 30 (06) ◽  
pp. 1850041
Author(s):  
Thakerng Wongsirichot ◽  
Anantaporn Hanskunatai

Sleep Stage Classification (SSC) is a standard process in the Polysomnography (PSG) for studying sleep patterns and events. The SSC provides sleep stage information of a patient throughout an entire sleep test. A physician uses results from SSCs to diagnose sleep disorder symptoms. However, the SSC data processing is time-consuming and requires trained sleep technicians to complete the task. Over the years, researchers attempted to find alternative methods, which are known as Automatic Sleep Stage Classification (ASSC), to perform the task faster and more efficiently. Proposed ASSC techniques usually derived from existing statistical methods and machine learning (ML) techniques. The objective of this study is to develop a new hybrid ASSC technique, Multi-Layer Hybrid Machine Learning Model (MLHM), for classifying sleep stages. The MLHM blends two baseline ML techniques, Decision Tree (DT) and Support Vector Machine (SVM). It operates on a newly developed multi-layer architecture. The multi-layer architecture consists of three layers for classifying [Formula: see text], [Formula: see text] and [Formula: see text], [Formula: see text], [Formula: see text] in different epoch lengths. Our experiment design compares MLHM and baseline ML techniques and other research works. The dataset used in this study was derived from the ISRUC-Sleep database comprising of 100 subjects. The classification performances were thoroughly reviewed using the hold-out and the 10-fold cross-validation method in both subject-specific and subject-independent classifications. The MLHM achieved a certain satisfactory classification results. It gained 0.694[Formula: see text][Formula: see text][Formula: see text]0.22 of accuracy ([Formula: see text]) in subject-specific classification and 0.942[Formula: see text][Formula: see text][Formula: see text]0.02 of accuracy ([Formula: see text]) in subject-independent classification. The pros and cons of the MLHM with the multi-layer architecture were thoroughly discussed. The effect of class imbalance was rationally discussed towards the classification results.


Sign in / Sign up

Export Citation Format

Share Document