Protein–protein interaction site prediction using random forest proximity distance

Author(s):  
Zhijun Qiu ◽  
Qingjie Liu

A front-end method based on random forest proximity distance (PD) is used to screen the test set to improve protein–protein interaction site (PPIS) prediction. The assessment of a distance metric is done under the assumption that a distance definition of higher quality leads to higher classification. On an independent test set, the numerical analysis based on statistical inference shows that the PD has the advantage over Mahalanobis and Cosine distance. Based on the fact that the proximity distance depends on the tree composition of the random forest model, an iterative method is designed to optimize the proximity distance, which adjusts the tree composition of the random forest model by adjusting the size of the training set. Two PD metrics, 75PD and 50PD, are obtained by the iterative method. On two independent test sets, compared with the PD produced by the original training set, the values of 75PD in Matthews correlation coefficient and F1 score were higher, and the differences between them were statistically significant. All numerical experiments show that the closer the distance between the test data and the training data, the better the prediction results of the predictor. These indicate that the iterative method can optimize proximity distance definition and the distance information provided by PD can be used to indicate the reliability of prediction results.

SLEEP ◽  
2020 ◽  
Vol 43 (Supplement_1) ◽  
pp. A224-A225
Author(s):  
S Kim ◽  
K Yang

Abstract Introduction The aim of this study was to develop a predicting model for the moderate-to-severe obstructive sleep apnea (OSA) by using advanced tree models. Methods We retrospectively investigated the medical records of patients who undertaken overnight polysomnography (PSG) at our sleep disorders center. We divided the data to a training set (70%) and a test set (30%), randomly. We made a random forest and a XGBoost model to predict the moderate-to-severe OSA (apnea hyponea index [AHI] ≥ 15/h) by using the training set, and then applied each models to the test set. To compare the fitness of the models, we used an accuracy, and an area under curve (AUC). Results Finally, 1,426 patients (AHI < 5:AHI ≥ 15= 464:962) were enrolled. The random forest model showed an accuracy of 0.79, and AUC of 0.82. In the random forest model, the sleep apnea scale of the sleep disorders questionnaire (SA-SDQ), age, neck circumference, male sex, body mass index (BMI), hypertension, and hyperlipidemia appeared in order of a variance importance. The XGBoost model showed an accuracy of 0.75 and AUC of 0.79. Conclusion The random forest model to predict moderate-to-severe OSA showed better performance compared to the XGBoost model. The further study for validation is required. Support None


2022 ◽  
Vol 11 ◽  
Author(s):  
Huangqi Zhang ◽  
Binhao Zhang ◽  
Wenting Pan ◽  
Xue Dong ◽  
Xin Li ◽  
...  

PurposeThis study aimed to develop a repeatable MRI-based machine learning model to differentiate between low-grade gliomas (LGGs) and glioblastoma (GBM) and provide more clinical information to improve treatment decision-making.MethodsPreoperative MRIs of gliomas from The Cancer Imaging Archive (TCIA)–GBM/LGG database were selected. The tumor on contrast-enhanced MRI was segmented. Quantitative image features were extracted from the segmentations. A random forest classification algorithm was used to establish a model in the training set. In the test phase, a random forest model was tested using an external test set. Three radiologists reviewed the images for the external test set. The area under the receiver operating characteristic curve (AUC) was calculated. The AUCs of the radiomics model and radiologists were compared.ResultsThe random forest model was fitted using a training set consisting of 142 patients [mean age, 52 years ± 16 (standard deviation); 78 men] comprising 88 cases of GBM. The external test set included 25 patients (14 with GBM). Random forest analysis yielded an AUC of 1.00 [95% confidence interval (CI): 0.86–1.00]. The AUCs for the three readers were 0.92 (95% CI 0.74–0.99), 0.70 (95% CI 0.49–0.87), and 0.59 (95% CI 0.38–0.78). Statistical differences were only found between AUC and Reader 1 (1.00 vs. 0.92, respectively; p = 0.16).ConclusionAn MRI radiomics-based random forest model was proven useful in differentiating GBM from LGG and showed better diagnostic performance than that of two inexperienced radiologists.


2021 ◽  
Vol 49 (3) ◽  
pp. 030006052199398
Author(s):  
Jinwu Peng ◽  
Zhili Duan ◽  
Yamin Guo ◽  
Xiaona Li ◽  
Xiaoqin Luo ◽  
...  

Objectives Liver echinococcosis is a severe zoonotic disease caused by Echinococcus (tapeworm) infection, which is epidemic in the Qinghai region of China. Here, we aimed to explore biomarkers and establish a predictive model for the diagnosis of liver echinococcosis. Methods Microarray profiling followed by Gene Ontology and Kyoto Encyclopedia of Genes and Genomes analysis was performed in liver tissue from patients with liver hydatid disease and from healthy controls from the Qinghai region of China. A protein–protein interaction (PPI) network and random forest model were established to identify potential biomarkers and predict the occurrence of liver echinococcosis, respectively. Results Microarray profiling identified 1152 differentially expressed genes (DEGs), including 936 upregulated genes and 216 downregulated genes. Several previously unreported biological processes and signaling pathways were identified. The FCGR2B and CTLA4 proteins were identified by the PPI networks and random forest model. The random forest model based on FCGR2B and CTLA4 reliably predicted the occurrence of liver hydatid disease, with an area under the receiver operator characteristic curve of 0.921. Conclusion Our findings give new insight into gene expression in patients with liver echinococcosis from the Qinghai region of China, improving our understanding of hepatic hydatid disease.


2018 ◽  
Vol 7 (2.21) ◽  
pp. 339 ◽  
Author(s):  
K Ulaga Priya ◽  
S Pushpa ◽  
K Kalaivani ◽  
A Sartiha

In Banking Industry loan Processing is a tedious task in identifying the default customers. Manual prediction of default customers might turn into a bad loan in future. Banks possess huge volume of behavioral data from which they are unable to make a judgement about prediction of loan defaulters. Modern techniques like Machine Learning will help to do analytical processing using Supervised Learning and Unsupervised Learning Technique. A data model for predicting default customers using Random forest Technique has been proposed. Data model Evaluation is done on training set and based on the performance parameters final prediction is done on the Test set. This is an evident that Random Forest technique will help the bank to predict the loan Defaulters with utmost accuracy.  


2005 ◽  
Vol 102 (10) ◽  
pp. 3593-3598 ◽  
Author(s):  
E. H. Kong ◽  
N. Heldring ◽  
J.-A. Gustafsson ◽  
E. Treuter ◽  
R. E. Hubbard ◽  
...  

Author(s):  
Priyanka A. Agharkar ◽  
Manivannan Ethirajan ◽  
Jianqun Liao ◽  
Michael Yemma ◽  
Andrew Magis ◽  
...  

EP Europace ◽  
2019 ◽  
Vol 21 (9) ◽  
pp. 1307-1312 ◽  
Author(s):  
Wei-Syun Hu ◽  
Meng-Hsuen Hsieh ◽  
Cheng-Li Lin

Abstract Aims We aimed to construct a random forest model to predict atrial fibrillation (AF) in Chinese population. Methods and results This study was comprised of 682 237 subjects with or without AF. Each subject had 19 features that included the subjects’ age, gender, underlying diseases, CHA2DS2-VASc score, and follow-up period. The data were split into train and test sets at an approximate 9:1 ratio: 614 013 data points were placed into the train set and 68 224 data points were placed into the test set. In this study, weighted average F1, precision, and recall values were used to measure prediction model performance. The F1, precision, and recall values were calculated across the train set, the test set, and all data. The area under receiving operating characteristic (ROC) curve was also used to evaluate the performance of the prediction model. The prediction model achieved a k-fold cross-validation accuracy of 0.979 (k = 10). In the test set, the prediction model achieved an F1 value of 0.968, precision value of 0.958, and recall value of 0.979. The area under ROC curve of the model was 0.948 (95% confidence interval 0.947–0.949). This model was validated with a separate dataset. Conclusions This study showed a novel AF risk prediction scheme for Chinese individuals with random forest model methodology.


2016 ◽  
Vol 92 (1-2) ◽  
pp. 105-116 ◽  
Author(s):  
Hong Li ◽  
Shiping Yang ◽  
Chuan Wang ◽  
Yuan Zhou ◽  
Ziding Zhang

Sign in / Sign up

Export Citation Format

Share Document