scholarly journals A Random Forest Classifier for Protein-Protein Docking Models

2021 ◽  
Author(s):  
Didier Barradas-Bautista ◽  
Zhen Cao ◽  
Anna Vangone ◽  
Romina Oliva ◽  
Luigi Cavallo

Herein, we present the results of a machine learning approach we developed to single out correct 3D docking models of protein-protein complexes obtained by popular docking software. To this aim, we generated a set of ~7xE06 docking models with three different docking programs (HADDOCK, FTDock and ZDOCK) for the 230 complexes in the protein-protein interaction benchmark, version 5 (BM5). Three different machine-learning approaches (Random Forest, Supported Vector Machine and Perceptron) were used to train classifiers with 158 different scoring functions (features). The Random Forest algorithm outperformed the other two algorithms and was selected for further optimization. Using a features selection algorithm, and optimizing the random forest hyperparameters, allowed us to train and validate a random forest classifier, named CoDES (COnservation Driven Expert System). Testing of CoDES on independent datasets, as well as results of its comparative performance with machine-learning methods recently developed in the field for the scoring of docking decoys, confirm its state-of-the-art ability to discriminate correct from incorrect decoys both in terms of global parameters and in terms of decoys ranked at the top positions.

Author(s):  
Varsha D Badal ◽  
Petras J Kundrotas ◽  
Ilya A Vakser

Abstract Motivation Procedures for structural modeling of protein-protein complexes (protein docking) produce a number of models which need to be further analyzed and scored. Scoring can be based on independently determined constraints on the structure of the complex, such as knowledge of amino acids essential for the protein interaction. Previously, we showed that text mining of residues in freely available PubMed abstracts of papers on studies of protein-protein interactions may generate such constraints. However, absence of post-processing of the spotted residues reduced usability of the constraints, as a significant number of the residues were not relevant for the binding of the specific proteins. Results We explored filtering of the irrelevant residues by two machine learning approaches, Deep Recursive Neural Network (DRNN) and Support Vector Machine (SVM) models with different training/testing schemes. The results showed that the DRNN model is superior to the SVM model when training is performed on the PMC-OA full-text articles and applied to classification (interface or non-interface) of the residues spotted in the PubMed abstracts. When both training and testing is performed on full-text articles or on abstracts, the performance of these models is similar. Thus, in such cases, there is no need to utilize computationally demanding DRNN approach, which is computationally expensive especially at the training stage. The reason is that SVM success is often determined by the similarity in data/text patterns in the training and the testing sets, whereas the sentence structures in the abstracts are, in general, different from those in the full text articles. Availability The code and the datasets generated in this study are available at https://gitlab.ku.edu/vakser-lab-public/text-mining/-/tree/2020-09-04. Supplementary information Supplementary data are available at Bioinformatics online.


Author(s):  
Singaravelan Shanmugasundaram ◽  
Parameswari M.

Utilizing machine learning approaches as non-obtrusive strategies is an elective technique in organizing perpetual liver infections for staying away from the downsides of biopsy. This chapter assesses diverse machine learning methods in expectation of cutting-edge fibrosis by joining the serum bio-markers and clinical data to build up the order models. An imminent accomplice of patients with incessant hepatitis C was separated into two sets—one classified as gentle to direct fibrosis (F0-F2) and the other ordered as cutting-edge fibrosis (F3-F4) as per METAVIR score. Grey wolf optimization, random forest classifier, and decision tree procedure models for cutting-edge fibrosis chance expectation were created. Recipient working trademark bend investigation was performed to assess the execution of the proposed models.


2020 ◽  
Vol 9 (9) ◽  
pp. 504
Author(s):  
Quy Truong ◽  
Guillaume Touya ◽  
Cyril Runz

Though Volunteered Geographic Information (VGI) has the advantage of providing free open spatial data, it is prone to vandalism, which may heavily decrease the quality of these data. Therefore, detecting vandalism in VGI may constitute a first way of assessing the data in order to improve their quality. This article explores the ability of supervised machine learning approaches to detect vandalism in OpenStreetMap (OSM) in an automated way. For this purpose, our work includes the construction of a corpus of vandalism data, given that no OSM vandalism corpus is available so far. Then, we investigate the ability of random forest methods to detect vandalism on the created corpus. Experimental results show that random forest classifiers perform well in detecting vandalism in the same geographical regions that were used for training the model and has more issues with vandalism detection in “unfamiliar regions”.


2021 ◽  
Author(s):  
Zuzana Jandova ◽  
Attilio Vittorio Vargiu ◽  
Alexandre M.J.J. Bonvin

Molecular docking excels at creating a plethora of potential models of protein-protein complexes. To correctly distinguish the favourable, native-like models from the remaining ones remains, however, a challenge. We assessed here if a protocol based on molecular dynamics (MD) simulations would allow to distinguish native from non-native models to complement scoring functions used in docking. To this end, first models for 25 protein-protein complexes were generated using HADDOCK. Next, MD simulations complemented with machine learning were used to discriminate between native and non-native complexes based on a combination of metrics reporting on the stability of the initial models. Native models showed higher stability in almost all measured properties, including the key ones used for scoring in the CAPRI competition, namely the positional root mean square deviations and fraction of native contacts from the initial docked model. A Random Forest classifier was trained, reaching 0.85 accuracy in correctly distinguishing native from non-native complexes. Reasonably modest simulation lengths in the order of 50 to 100 ns are already sufficient to reach this accuracy, which makes this approach applicable in practice.


Author(s):  
Jun Pei ◽  
Zheng Zheng ◽  
Hyunji Kim ◽  
Lin Song ◽  
Sarah Walworth ◽  
...  

An accurate scoring function is expected to correctly select the most stable structure from a set of pose candidates. One can hypothesize that a scoring function’s ability to identify the most stable structure might be improved by emphasizing the most relevant atom pairwise interactions. However, it is hard to evaluate the relevant importance for each atom pair using traditional means. With the introduction of machine learning methods, it has become possible to determine the relative importance for each atom pair present in a scoring function. In this work, we use the Random Forest (RF) method to refine a pair potential developed by our laboratory (GARF6) by identifying relevant atom pairs that optimize the performance of the potential on our given task. Our goal is to construct a machine learning (ML) model that can accurately differentiate the native ligand binding pose from candidate poses using a potential refined by RF optimization. We successfully constructed RF models on an unbalanced data set with the ‘comparison’ concept and, the resultant RF models were tested on CASF-2013.5 In a comparison of the performance of our RF models against 29 scoring functions, we found our models outperformed the other scoring functions in predicting the native pose. In addition, we used two artificial designed potential models to address the importance of the GARF potential in the RF models: (1) a scrambled probability function set, which was obtained by mixing up atom pairs and probability functions in GARF, and (2) a uniform probability function set, which share the same peak positions with GARF but have fixed peak heights. The results of accuracy comparison from RF models based on the scrambled, uniform, and original GARF potential clearly showed that the peak positions in the GARF potential are important while the well depths are not. <br>


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Elisabeth Sartoretti ◽  
Thomas Sartoretti ◽  
Michael Wyss ◽  
Carolin Reischauer ◽  
Luuk van Smoorenburg ◽  
...  

AbstractWe sought to evaluate the utility of radiomics for Amide Proton Transfer weighted (APTw) imaging by assessing its value in differentiating brain metastases from high- and low grade glial brain tumors. We retrospectively identified 48 treatment-naïve patients (10 WHO grade 2, 1 WHO grade 3, 10 WHO grade 4 primary glial brain tumors and 27 metastases) with either primary glial brain tumors or metastases who had undergone APTw MR imaging. After image analysis with radiomics feature extraction and post-processing, machine learning algorithms (multilayer perceptron machine learning algorithm; random forest classifier) with stratified tenfold cross validation were trained on features and were used to differentiate the brain neoplasms. The multilayer perceptron achieved an AUC of 0.836 (receiver operating characteristic curve) in differentiating primary glial brain tumors from metastases. The random forest classifier achieved an AUC of 0.868 in differentiating WHO grade 4 from WHO grade 2/3 primary glial brain tumors. For the differentiation of WHO grade 4 tumors from grade 2/3 tumors and metastases an average AUC of 0.797 was achieved. Our results indicate that the use of radiomics for APTw imaging is feasible and the differentiation of primary glial brain tumors from metastases is achievable with a high degree of accuracy.


2017 ◽  
Vol 25 (3) ◽  
pp. 811-827 ◽  
Author(s):  
Dimitris Spathis ◽  
Panayiotis Vlamos

This study examines the clinical decision support systems in healthcare, in particular about the prevention, diagnosis and treatment of respiratory diseases, such as Asthma and chronic obstructive pulmonary disease. The empirical pulmonology study of a representative sample (n = 132) attempts to identify the major factors that contribute to the diagnosis of these diseases. Machine learning results show that in chronic obstructive pulmonary disease’s case, Random Forest classifier outperforms other techniques with 97.7 per cent precision, while the most prominent attributes for diagnosis are smoking, forced expiratory volume 1, age and forced vital capacity. In asthma’s case, the best precision, 80.3 per cent, is achieved again with the Random Forest classifier, while the most prominent attribute is MEF2575.


In universities, student dropout is a major concern that reflects the university's quality. Some characteristics cause students to drop out of university. A high dropout rate of students affects the university's reputation and the student's careers in the future. Therefore, there's a requirement for student dropout analysis to enhance academic plan and management to scale back student's drop out from the university also on enhancing the standard of the upper education system. The machine learning technique provides powerful methods for the analysis and therefore the prediction of the dropout. This study uses a dataset from a university representative to develop a model for predicting student dropout. In this work, machine- learning models were used to detect dropout rates. Machine learning is being more widely used in the field of knowledge mining diagnostics. Following an examination of certain studies, we observed that dropout detection may be done using several methods. We've even used five dropout detection models. These models are Decision tree, Naïve bayes, Random Forest Classifier, SVM and KNN. We used machine-learning technology to analyze the data, and we discovered that the Random Forest classifier is highly promising for predicting dropout rates, with a training accuracy of 94% and a testing accuracy of 86%.


Sign in / Sign up

Export Citation Format

Share Document