scholarly journals Deep exploration of random forest model boosts the interpretability of machine learning studies of complicated immune responses and lung burden of nanoparticles

2021 ◽  
Vol 7 (22) ◽  
pp. eabf4130
Author(s):  
Fubo Yu ◽  
Changhong Wei ◽  
Peng Deng ◽  
Ting Peng ◽  
Xiangang Hu

The development of machine learning provides solutions for predicting the complicated immune responses and pharmacokinetics of nanoparticles (NPs) in vivo. However, highly heterogeneous data in NP studies remain challenging because of the low interpretability of machine learning. Here, we propose a tree-based random forest feature importance and feature interaction network analysis framework (TBRFA) and accurately predict the pulmonary immune responses and lung burden of NPs, with the correlation coefficient of all training sets >0.9 and half of the test sets >0.75. This framework overcomes the feature importance bias brought by small datasets through a multiway importance analysis. TBRFA also builds feature interaction networks, boosts model interpretability, and reveals hidden interactional factors (e.g., various NP properties and exposure conditions). TBRFA provides guidance for the design and application of ideal NPs and discovers the feature interaction networks that contribute to complex systems with small-size data in various fields.

2020 ◽  
Author(s):  
Piyush Mathur ◽  
Tavpritesh Sethi ◽  
Anya Mathur ◽  
Kamal Maheshwari ◽  
Jacek Cywinski ◽  
...  

UNSTRUCTURED Introduction The COVID-19 pandemic exhibits an uneven geographic spread which leads to a locational mismatch of testing, mitigation measures and allocation of healthcare resources (human, equipment, and infrastructure).(1) In the absence of effective treatment, understanding and predicting the spread of COVID-19 is unquestionably valuable for public health and hospital authorities to plan for and manage the pandemic. While there have been many models developed to predict mortality, the authors sought to develop a machine learning prediction model that provides an estimate of the relative association of socioeconomic, demographic, travel, and health care characteristics of COVID-19 disease mortality among states in the United States(US). Methods State-wise data was collected for all the features predicting COVID-19 mortality and for deriving feature importance (eTable 1 in the Supplement).(2) Key feature categories include demographic characteristics of the population, pre-existing healthcare utilization, travel, weather, socioeconomic variables, racial distribution and timing of disease mitigation measures (Figure 1 & 2). Two machine learning models, Catboost regression and random forest were trained independently to predict mortality in states on data partitioned into a training (80%) and test (20%) set.(3) Accuracy of models was assessed by R2 score. Importance of the features for prediction of mortality was calculated via two machine learning algorithms - SHAP (SHapley Additive exPlanations) calculated upon CatBoost model and Boruta, a random forest based method trained with 10,000 trees for calculating statistical significance (3-5). Results Results are based on 60,604 total deaths in the US, as of April 30, 2020. Actual number of deaths ranged widely from 7 (Wyoming) to 18,909 (New York).CatBoost regression model obtained an R2 score of 0.99 on the training data set and 0.50 on the test set. Random Forest model obtained an R2 score of 0.88 on the training data set and 0.39 on the test set. Nine out of twenty variables were significantly higher than the maximum variable importance achieved by the shadow dataset in Boruta regression (Figure 2).Both models showed the high feature importance for pre-existing high healthcare utilization reflective in nursing home beds per capita and doctors per 100,000 population. Overall population characteristics such as total population and population density also correlated positively with the number of deaths.Notably, both models revealed a high positive correlation of deaths with percentage of African Americans. Direct flights from China, especially Wuhan were also significant in both models as predictors of death, therefore reflecting early spread of the disease. Associations between deaths and weather patterns, hospital bed capacity, median age, timing of administrative action to mitigate disease spread such as the closure of educational institutions or stay at home order were not significant. The lack of some associations, e.g., administrative action may reflect delayed outcomes of interventions which were not yet reflected in data. Discussion COVID-19 disease has varied spread and mortality across communities amongst different states in the US. While our models show that high population density, pre-existing need for medical care and foreign travel may increase transmission and thus COVID-19 mortality, the effect of geographic, climate and racial disparities on COVID-19 related mortality is not clear. The purpose of our study was not state-wise accurate prediction of deaths in the US, which has already been challenging.(6) Location based understanding of key determinants of COVID-19 mortality, is critically needed for focused targeting of mitigation and control measures. Risk assessment-based understanding of determinants affecting COVID-19 outcomes, using a dynamic and scalable machine learning model such as the two proposed, can help guide resource management and policy framework.


2019 ◽  
Vol 8 (4) ◽  
pp. 1477-1483

With the fast moving technological advancement, the internet usage has been increased rapidly in all the fields. The money transactions for all the applications like online shopping, banking transactions, bill settlement in any industries, online ticket booking for travel and hotels, Fees payment for educational organization, Payment for treatment to hospitals, Payment for super market and variety of applications are using online credit card transactions. This leads to the fraud usage of other accounts and transaction that result in the loss of service and profit to the institution. With this background, this paper focuses on predicting the fraudulent credit card transaction. The Credit Card Transaction dataset from KAGGLE machine learning Repository is used for prediction analysis. The analysis of fraudulent credit card transaction is achieved in four ways. Firstly, the relationship between the variables of the dataset is identified and represented by the graphical notations. Secondly, the feature importance of the dataset is identified using Random Forest, Ada boost, Logistic Regression, Decision Tree, Extra Tree, Gradient Boosting and Naive Bayes classifiers. Thirdly, the extracted feature importance if the credit card transaction dataset is fitted to Random Forest classifier, Ada boost classifier, Logistic Regression classifier, Decision Tree classifier, Extra Tree classifier, Gradient Boosting classifier and Naive Bayes classifier. Fourth, the Performance Analysis is done by analyzing the performance metrics like Accuracy, FScore, AUC Score, Precision and Recall. The implementation is done by python in Anaconda Spyder Navigator Integrated Development Environment. Experimental Results shows that the Decision Tree classifier have achieved the effective prediction with the precision of 1.0, recall of 1.0, FScore of 1.0 , AUC Score of 89.09 and Accuracy of 99.92%.


2021 ◽  
Author(s):  
Tarik Abdelfattah ◽  
Ehsaan Nasir ◽  
Junjie Yang ◽  
Jamar Bynum ◽  
Alexander Klebanov ◽  
...  

Abstract Unconventional reservoir development is a multidisciplinary challenge due to complicated physical system, including but not limited to complicated flow mechanism, multiple porosity system, heterogeneous subsurface rock and minerals, well interference, and fluid-rock interaction. With enough well data, physics-based models can be supplemented with data driven methods to describe a reservoir system and accurately predict well performance. This study uses a data driven approach to tackle the field development problem in the Eagle Ford Shale. A large amount of data spanning major oil and gas disciplines was collected and interrogated from around 300 wells in the area of interest. The data driven workflow consists of: Descriptive model to regress on existing wells with the selected well features and provide insight on feature importance, Predictive model to forecast well performance, and Subject matter expert driven prescriptive model to optimize future well design for well economics improvement. To evaluate initial well economics, 365 consecutive days of production oil per CAPEX dollar spent (bbl/$) was setup as the objective function. After a careful model selection, Random Forest (RF) shows the best accuracy with the given dataset, and Differential Evolution (DE) was used for optimization. Using recursive feature elimination (RFE), the final master dataset was reduced to 50 parameters to feed into the machine learning model. After hyperparameter tuning, reasonable regression accuracy was achieved by the Random Forest algorithm, where correlation coefficient (R2) for the training and test dataset was 0.83, and mean absolute error percentage (MAEP) was less than 20%. The model also reveals that the well performance is highly dependent on a good combination of variables spanning geology, drilling, completions, production and reservoir. Completion year has one of the highest feature importance, indicating the improvement of operation and design efficiency and the fluctuation of service cost. Moreover, lateral rate of penetration (ROP) was always amongst the top two important parameters most likely because it impacts the drilling cost significantly. With subject matter experts’ (SME) input, optimization using the regression model was performed in an iterative manner with the chosen parameters and using reasonable upper and lower bounds. Compared to the best existing wells in the vicinity, the optimized well design shows a potential improvement on bbl/$ by approximately 38%. This paper introduces an integrated data driven solution to optimize unconventional development strategy. Comparing to conventional analytical and numerical methods, machine learning model is able to handle large multidimensional dataset and provide actionable recommendations with a much faster turnaround. In the course of field development, the model accuracy can be dynamically improved by including more data collected from new wells.


2017 ◽  
Author(s):  
Peter F. Neher ◽  
Marc-Alexandre Côté ◽  
Jean-Christophe Houde ◽  
Maxime Descoteaux ◽  
Klaus H. Maier-Hein

AbstractWe present a fiber tractography approach based on a random forest classification and voting process, guiding each step of the streamline progression by directly processing raw diffusion-weighted signal intensities. For comparison to the state-of-the-art, i.e. tractography pipelines that rely on mathematical modeling, we performed a quantitative and qualitative evaluation with multiple phantom andin vivoexperiments, including a comparison to the 96 submissions of the ISMRM tractography challenge 2015. The results demonstrate the vast potential of machine learning for fiber tractography.


Medicines ◽  
2021 ◽  
Vol 8 (11) ◽  
pp. 66
Author(s):  
Charat Thongprayoon ◽  
Caroline C. Jadlowiec ◽  
Napat Leeaphorn ◽  
Jackrapong Bruminhent ◽  
Prakrati C. Acharya ◽  
...  

Background: Black kidney transplant recipients have worse allograft outcomes compared to White recipients. The feature importance and feature interaction network analysis framework of machine learning random forest (RF) analysis may provide an understanding of RF structures to design strategies to prevent acute rejection among Black recipients. Methods: We conducted tree-based RF feature importance of Black kidney transplant recipients in United States from 2015 to 2019 in the UNOS database using the number of nodes, accuracy decrease, gini decrease, times_a_root, p value, and mean minimal depth. Feature interaction analysis was also performed to evaluate the most frequent occurrences in the RF classification run between correlated and uncorrelated pairs. Results: A total of 22,687 Black kidney transplant recipients were eligible for analysis. Of these, 1330 (6%) had acute rejection within 1 year after kidney transplant. Important variables in the RF models for acute rejection among Black kidney transplant recipients included recipient age, ESKD etiology, PRA, cold ischemia time, donor age, HLA DR mismatch, BMI, serum albumin, degree of HLA mismatch, education level, and dialysis duration. The three most frequent interactions consisted of two numerical variables, including recipient age:donor age, recipient age:serum albumin, and recipient age:BMI, respectively. Conclusions: The application of tree-based RF feature importance and feature interaction network analysis framework identified recipient age, ESKD etiology, PRA, cold ischemia time, donor age, HLA DR mismatch, BMI, serum albumin, degree of HLA mismatch, education level, and dialysis duration as important variables in the RF models for acute rejection among Black kidney transplant recipients in the United States.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Ebraheem Alzahrani ◽  
Wajdi Alghamdi ◽  
Malik Zaka Ullah ◽  
Yaser Daanial Khan

AbstractProteins are a vital component of cells that perform physiological functions to ensure smooth operations of bodily functions. Identification of a protein's function involves a detailed understanding of the structure of proteins. Stress proteins are essential mediators of several responses to cellular stress and are categorized based on their structural characteristics. These proteins are found to be conserved across many eukaryotic and prokaryotic linkages and demonstrate varied crucial functional activities inside a cell. The in-vivo, ex vivo, and in-vitro identification of stress proteins are a time-consuming and costly task. This study is aimed at the identification of stress protein sequences with the aid of mathematical modelling and machine learning methods to supplement the aforementioned wet lab methods. The model developed using Random Forest showed remarkable results with 91.1% accuracy while models based on neural network and support vector machine showed 87.7% and 47.0% accuracy, respectively. Based on evaluation results it was concluded that random-forest based classifier surpassed all other predictors and is suitable for use in practical applications for the identification of stress proteins. Live web server is available at http://biopred.org/stressprotiens, while the webserver code available is at https://github.com/abdullah5naveed/SRP_WebServer.git


2021 ◽  
Author(s):  
Meghana Venkata Palukuri ◽  
Edward M Marcotte

Protein complexes can be computationally identified from protein-interaction networks with community detection methods, suggesting new multi-protein assemblies. Most community detection algorithms tend to be un- or semi-supervised and assume that communities are dense network subgraphs, which is not always true, as protein complexes can exhibit diverse network topologies. The few existing supervised machine learning methods are serial and can potentially be improved in terms of accuracy and scalability by using better-suited machine learning models and by using parallel algorithms, respectively.  Here, we present Super.Complex, a distributed supervised machine learning pipeline for community detection in networks. Super.Complex learns a community fitness function from known communities using an AutoML method and applies this fitness function to detect new communities. A heuristic local search algorithm finds maximally scoring communities with epsilon-greedy and pseudo-metropolis criteria, and an embarrassingly parallel implementation can be run on a computer cluster for scaling to large networks. In order to evaluate Super.Complex, we propose three new measures for the still outstanding issue of comparing sets of learned and known communities. On a yeast protein-interaction network, Super.Complex outperforms 6 other supervised and 4 unsupervised methods. Application of Super.Complex to a human protein-interaction network with ~8k nodes and ~60k edges yields 1,028 protein complexes, with 234 complexes linked to SARS-CoV-2, with 111 uncharacterized proteins present in 103 learned complexes. Super.Complex is generalizable and can be used in different applications of community detection, with the ability to improve results by incorporating domain-specific features. Learned community characteristics can also be transferred from existing applications to detect communities in a new application with no known communities. Code and interactive visualizations of learned human protein complexes are freely available at: https://sites.google.com/view/supercomplex/super-complex-v3-0 .


2017 ◽  
Author(s):  
Tom Hiscock

AbstractBiological systems rely on complex networks, such as transcriptional circuits and protein-protein interaction networks, to perform a variety of functions e.g. responding to stimuli, directing cell fate, or patterning an embryo. Mathematical models are often used to ask: given some network, what function does it perform? However, we often want precisely the opposite i.e. given some circuit – either observedin vivo, or desired for some engineering objective – what biological networks could execute this function? Here, we adapt optimization algorithms from machine learning to rapidly screen and design gene circuits capable of performing arbitrary functions. We demonstrate the power of this approach by designing circuits (1) that recapitulate importantin vivophenomena, such as oscillators, and (2) to perform complex tasks for synthetic biology, such as counting noisy biological events. Our method can be readily applied to biological networks of any type and size, and is provided as an open-source and easy-to-use python module, GeneNet.


2020 ◽  
Vol 7 ◽  
Author(s):  
Holmes Yesid Ayala-Yaguara ◽  
Gina Maribel Valenzuela-Sabogal ◽  
Alexander Espinosa-García

En el presente artículo se describe la obtención de un modelo de minería de datos aplicado al problema de la deserción universitaria en el programa de Ingeniería de Sistemas de la Universidad de Cundinamarca, extensión Facatativá. El modelo se estructuró mediante la metodología de minería de datos KDD (knowledge discovery in databases) haciendo uso del lenguaje de programación Python, la librería de procesamiento de datos Pandas y de machine learning Sklearn. Para el proceso se tuvieron en cuenta problemas adicionales al proceso de minería, como, por ejemplo, la alta dimensionalidad, por lo cual se aplicaron los métodos de selección de las variables estadístico univariado, feature importance y SelectFromModel (Sklearn). En el proyecto se seleccionaron cinco técnicas de minería de datos para evaluarlas: vecinos más cercanos (K nearest neighbors, KNN), árboles de decisión (decision tree, DT), árboles aleatorios (random forest, RF), regresión logística (logistic regression, LR) y máquinas de vectores soporte (support vector machines, SVM). Respecto a la selección del modelo final se evaluaron los resultados de cada modelo en las métricas de precisión, matriz de confusión y métricas adicionales de la matriz de confusión. Por último, se ajustaron los parámetros del modelo seleccionado y se evaluó la generalización del modelo al graficar su curva de aprendizaje.


2021 ◽  
Author(s):  
Yanrong Cai ◽  
Xiang Jiang ◽  
Weifan Dai ◽  
Qinyuan Yu

Abstract BackgroundFractures of pelvis and/or Acetabulum are leading risks of death worldwide. However, the capability of in-hospital mortality prediction by conventional system is so far limited. Here, we hypothesis that the use of machine learning (ML) algorithms could provide better performance of prediction than the traditional scoring system Simple Acute Physiologic Score (SAPS) II for patients with pelvic and acetabular trauma in intensive care unit (ICU).MethodsWe developed customized mortality prediction models with ML techniques based on MIMIC-III, an open access de-defined database consisting of data from more than 25,000 patients who were admitted to the Beth Israel Deaconess Medical Center (BIDMC). 307 patients were enrolled with an ICD-9 diagnosis of pelvic, acetabular or combined pelvic and acetabular fractures and who had an ICU stay more than 72 hours. ML models including decision tree, logistic regression and random forest were established by using the SAPS II features from the first 72 hours after ICU admission and the traditional first-24-hours features were used to build respective control models. We evaluated and made a comparison of each model’s performance through the area under the receiver-operating characteristic curve (AUROC). Feature importance method was used to visualize top risk factors for disease mortality.ResultsAll the ML models outperformed the traditional scoring system SAPS II (AUROC=0.73), among which the best fitted random forest model had the supreme performance (AUROC of 0.90). With the use of evolution of physiological features over time rather than 24-hours snapshots, all the ML models performed better than respective controls. Age remained the top of feature importance for all classifiers. Age, BUN (minimum value on day 2), and BUN (maximum value on day 3) were the top 3 predictor variables in the optimal random forest experiment model. In the best decision tree model, the top 3 risk factors, in decreasing order of contribution, were age, the lowest systolic blood pressure on day 1 and the same value on day 3.ConclusionThe results suggested that mortality modeling with ML techniques could aid in better performance of prediction for models in the context of pelvic and acetabular trauma and potentially support decision-making for orthopedics and ICU practitioners.


Sign in / Sign up

Export Citation Format

Share Document