Predictive Modeling of Surgical Site Infections Using Sparse Laboratory Data

2020 ◽  
pp. 410-423
Author(s):  
Prabhu RV Shankar ◽  
Anupama Kesari ◽  
Priya Shalini ◽  
N. Kamalashree ◽  
Charan Bharadwaj ◽  
...  

As part of a data mining competition, a training and test set of laboratory test data about patients with and without surgical site infection (SSI) were provided. The task was to develop predictive models with training set and identify patients with SSI in the no label test set. Lab test results are vital resources that guide healthcare providers make decisions about all aspects of surgical patient management. Many machine learning models were developed after pre-processing and imputing the lab tests data and only the top performing methods are discussed. Overall, RANDOM FOREST algorithms performed better than Support Vector Machine and Logistic Regression. Using a set of 74 lab tests, with RF, there were only 4 false positives in the training set and predicted 35 out of 50 SSI patients in the test set (Accuracy 0.86, Sensitivity 0.68, and Specificity 0.91). Optimal ways to address healthcare data quality concerns and imputation methods as well as newer generalizable algorithms need to be explored further to decipher new associations and knowledge among laboratory biomarkers and SSI.

Author(s):  
Prabhu RV Shankar ◽  
Anupama Kesari ◽  
Priya Shalini ◽  
N. Kamalashree ◽  
Charan Bharadwaj ◽  
...  

As part of a data mining competition, a training and test set of laboratory test data about patients with and without surgical site infection (SSI) were provided. The task was to develop predictive models with training set and identify patients with SSI in the no label test set. Lab test results are vital resources that guide healthcare providers make decisions about all aspects of surgical patient management. Many machine learning models were developed after pre-processing and imputing the lab tests data and only the top performing methods are discussed. Overall, RANDOM FOREST algorithms performed better than Support Vector Machine and Logistic Regression. Using a set of 74 lab tests, with RF, there were only 4 false positives in the training set and predicted 35 out of 50 SSI patients in the test set (Accuracy 0.86, Sensitivity 0.68, and Specificity 0.91). Optimal ways to address healthcare data quality concerns and imputation methods as well as newer generalizable algorithms need to be explored further to decipher new associations and knowledge among laboratory biomarkers and SSI.


SPE Journal ◽  
2018 ◽  
Vol 23 (04) ◽  
pp. 1075-1089 ◽  
Author(s):  
Jared Schuetter ◽  
Srikanta Mishra ◽  
Ming Zhong ◽  
Randy LaFollette (ret.)

Summary Considerable amounts of data are being generated during the development and operation of unconventional reservoirs. Statistical methods that can provide data-driven insights into production performance are gaining in popularity. Unfortunately, the application of advanced statistical algorithms remains somewhat of a mystery to petroleum engineers and geoscientists. The objective of this paper is to provide some clarity to this issue, focusing on how to build robust predictive models and how to develop decision rules that help identify factors separating good wells from poor performers. The data for this study come from wells completed in the Wolfcamp Shale Formation in the Permian Basin. Data categories used in the study included well location and assorted metrics capturing various aspects of well architecture, well completion, stimulation, and production. Predictive models for the production metric of interest are built using simple regression and other advanced methods such as random forests (RFs), support-vector regression (SVR), gradient-boosting machine (GBM), and multidimensional Kriging. The data-fitting process involves splitting the data into a training set and a test set, building a regression model on the training set and validating it with the test set. Repeated application of a “cross-validation” procedure yields valuable information regarding the robustness of each regression-modeling approach. Furthermore, decision rules that can identify extreme behavior in production wells (i.e., top x% of the wells vs. bottom x%, as ranked by the production metric) are generated using the classification and regression-tree algorithm. The resulting decision tree (DT) provides useful insights regarding what variables (or combinations of variables) can drive production performance into such extreme categories. The main contributions of this paper are to provide guidelines on how to build robust predictive models, and to demonstrate the utility of DTs for identifying factors responsible for good vs. poor wells.


2011 ◽  
Vol 460-461 ◽  
pp. 667-672
Author(s):  
Yun Zhao ◽  
Xing Xu ◽  
Yong He

The main objective of this paper is to classify four kinds of automobile lubricant by near-infrared (NIR) spectral technology and to observe whether NIR spectroscopy could be used for predicting water content. Principle component analysis (PCA) was applied to reduce the information from the spectral data and first two PCs were used to cluster the samples. Partial least square (PLS), least square support vector machine (LS-SVM), and Gaussian processes classification (GPC) were employed to develop prediction models. There were 120 samples for training set and test set. Two LS-SVM models with first five PCs and first six PCs were built, respectively, and accuracy of the model with five PCs is adequate with less calculation. The results from the experiment indicate that the LS-SVM model outperforms the PLS model and GPC model outperforms the LS-SVM model.


2007 ◽  
Vol 06 (03) ◽  
pp. 495-509 ◽  
Author(s):  
GUI-KAI YAN ◽  
JUN-JIE LI ◽  
BING-RUI LI ◽  
JIA HU ◽  
WEN-PING GUO

Support vector machine (SVM) is used to predict the enthalpies of formation at 298 K [Formula: see text] for 261 molecules based on B3LYP/6-311g (3df,2p) results. With data randomly separated into two parts: 195 for training set and 66 for test set, the resulting mean absolute deviation (MAD) and maximum deviation (MD) for training set are 1.51 kcal/mol and 9.23 kcal/mol (correlation coefficient R = 0.9995), and for test set they become to 1.78 kcal/mol and 7.31 kcal/mol (R = 0.9990). The result is improved according to G2 method.


2020 ◽  
Author(s):  
Chunbo Kang ◽  
Xubin Li ◽  
Xiaoqian Chi ◽  
Yabin Yang ◽  
Haifeng Shan ◽  
...  

Abstract BACKGROUND Accurate preoperative prediction of complicated appendicitis (CA) could help selecting optimal treatment and reducing risks of postoperative complications. The study aimed to develop a machine learning model based on clinical symptoms and laboratory data for preoperatively predicting CA.METHODS 136 patients with clinicopathological diagnosis of acute appendicitis were retrospectively included in the study. The dataset was randomly divided (94: 42) into training and testing set. Predictive models using individual and combined selected clinical and laboratory data features were built separately. Three combined models were constructed using logistic regression (LR), support vector machine (SVM) and random forest (RF) algorithms. The CA prediction performance was evaluated with Receiver Operating Characteristic (ROC) analysis, using the area under the curve (AUC), sensitivity, specificity and accuracy factors.RESULTS The features of the abdominal pain time, nausea and vomiting, the highest temperature, high sensitivity-CRP (hs-CRP) and procalcitonin (PCT) had significant differences in the CA prediction (P<0.001). The ability to predict CA by individual feature was low (AUC<0.8). The prediction by combined features was significantly improved. The AUC of the three models (LR, SVM and RF) in the training set and the testing set were 0.805, 0.888, 0.908 and 0.794, 0.895, 0.761, respectively. The SVM-based model showed a better performance for CA prediction. RF had a higher AUC in the training set, but its poor efficiency in the testing set indicated a poor generalization ability.CONCLUSIONS The SVM machine learning model applying clinical and laboratory data can well predict CA preoperatively which could assist diagnosis in resource limited settings.


Author(s):  
Ade Nurhopipah ◽  
Uswatun Hasanah

The performance of classification models in machine learning algorithms is influenced by many factors, one of which is dataset splitting method. To avoid overfitting, it is important to apply a suitable dataset splitting strategy. This study presents comparison of four dataset splitting techniques, namely Random Sub-sampling Validation (RSV), k-Fold Cross Validation (k-FCV), Bootstrap Validation (BV) and Moralis Lima Martin Validation (MLMV). This comparison is done in face classification on CCTV images using Convolutional Neural Network (CNN) algorithm and Support Vector Machine (SVM) algorithm. This study is also applied in two image datasets. The results of the comparison are reviewed by using model accuracy in training set, validation set and test set, also bias and variance of the model. The experiment shows that k-FCV technique has more stable performance and provide high accuracy on training set as well as good generalizations on validation set and test set. Meanwhile, data splitting using MLMV technique has lower performance than the other three techniques since it yields lower accuracy. This technique also shows higher bias and variance values and it builds overfitting models, especially when it is applied on validation set.


Author(s):  
Botao Jiang ◽  
Fuyu Zhao

Critical heat flux (CHF) is one of the most crucial design criteria in other boiling systems such as evaporator, steam generators, fuel cooling system, boiler, etc. This paper presents an alternative CHF prediction method named projection support vector regression (PSVR), which is a combination of feature vector selection (FVS) method and support vector regression (SVR). In PSVR, the FVS method is first used to select a relevant subset (feature vectors, FVs) from the training data, and then both the training data and the test data are projected into the subspace constructed by FVs, and finally SVR is applied to estimate the projected data. An available CHF dataset taken from the literature is used in this paper. The CHF data are split into two subsets, the training set and the test set. The training set is used to train the PSVR model and the test set is then used to evaluate the trained model. The predicted results of PSVR are compared with those of artificial neural networks (ANNs). The parametric trends of CHF are also investigated using the PSVR model. It is found that the results of the proposed method not only fit the general understanding, but also agree well with the experimental data. Thus, PSVR can be used successfully for prediction of CHF in contrast to ANNs.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Eunyoung Emily Lee ◽  
Woochang Hwang ◽  
Kyoung-Ho Song ◽  
Jongtak Jung ◽  
Chang Kyung Kang ◽  
...  

AbstractThe objective of the study was to develop and validate a prediction model that identifies COVID-19 patients at risk of requiring oxygen support based on five parameters: C-reactive protein (CRP), hypertension, age, and neutrophil and lymphocyte counts (CHANeL). This retrospective cohort study included 221 consecutive COVID-19 patients and the patients were randomly assigned randomly to a training set and a test set in a ratio of 1:1. Logistic regression, logistic LASSO regression, Random Forest, Support Vector Machine, and XGBoost analyses were performed based on age, hypertension status, serial CRP, and neutrophil and lymphocyte counts during the first 3 days of hospitalization. The ability of the model to predict oxygen requirement during hospitalization was tested. During hospitalization, 45 (41.8%) patients in the training set (n = 110) and 41 (36.9%) in the test set (n = 111) required supplementary oxygen support. The logistic LASSO regression model exhibited the highest AUC for the test set, with a sensitivity of 0.927 and a specificity of 0.814. An online risk calculator for oxygen requirement using CHANeL predictors was developed. “CHANeL” prediction models based on serial CRP, neutrophil, and lymphocyte counts during the first 3 days of hospitalization, along with age and hypertension status, provide a reliable estimate of the risk of supplement oxygen requirement among patients hospitalized with COVID-19.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Qing Ning ◽  
Dali Wang ◽  
Fei Cheng ◽  
Yuheng Zhong ◽  
Qi Ding ◽  
...  

Abstract Background Mutations in an enzyme target are one of the most common mechanisms whereby antibiotic resistance arises. Identification of the resistance mutations in bacteria is essential for understanding the structural basis of antibiotic resistance and design of new drugs. However, the traditionally used experimental approaches to identify resistance mutations were usually labor-intensive and costly. Results We present a machine learning (ML)-based classifier for predicting rifampicin (Rif) resistance mutations in bacterial RNA Polymerase subunit β (RpoB). A total of 186 mutations were gathered from the literature for developing the classifier, using 80% of the data as the training set and the rest as the test set. The features of the mutated RpoB and their binding energies with Rif were calculated through computational methods, and used as the mutation attributes for modeling. Classifiers based on five ML algorithms, i.e. decision tree, k nearest neighbors, naïve Bayes, probabilistic neural network and support vector machine, were first built, and a majority consensus (MC) approach was then used to obtain a new classifier based on the classifications of the five individual ML algorithms. The MC classifier comprehensively improved the predictive performance, with accuracy, F-measure and AUC of 0.78, 0.83 and 0.81for training set whilst 0.84, 0.87 and 0.83 for test set, respectively. Conclusion The MC classifier provides an alternative methodology for rapid identification of resistance mutations in bacteria, which may help with early detection of antibiotic resistance and new drug discovery.


2020 ◽  
Vol 16 (5) ◽  
pp. 654-666 ◽  
Author(s):  
Yang Li ◽  
Yujia Tian ◽  
Yao Xi ◽  
Zijian Qin ◽  
Aixia Yan

Background: HIV-1 Integrase (IN) is an important target for the development of the new anti-AIDS drugs. HIV-1 LEDGF/p75 inhibitors, which block the integrase and LEDGF/p75 interaction, have been validated for reduction in HIV-1 viral replicative capacity. Methods: In this work, computational Quantitative Structure-Activity Relationship (QSAR) models were developed for predicting the bioactivity of HIV-1 integrase LEDGF/p75 inhibitors. We collected 190 inhibitors and their bioactivities in this study and divided the inhibitors into nine scaffolds by the method of T-distributed Stochastic Neighbor Embedding (TSNE). These 190 inhibitors were split into a training set and a test set according to the result of a Kohonen’s self-organizing map (SOM) or randomly. Multiple Linear Regression (MLR) models, support vector machine (SVM) models and two consensus models were built based on the training sets by 20 selected CORINA Symphony descriptors. Results: All the models showed a good prediction of pIC50. The correlation coefficients of all the models were more than 0.7 on the test set. For the training set of consensus Model C1, which performed better than other models, the correlation coefficient(r) achieved 0.909 on the training set, and 0.804 on the test set. Conclusion: The selected molecular descriptors show that hydrogen bond acceptor, atom charges and electronegativities (especially π atom) were important in predicting the activity of HIV-1 integrase LEDGF/p75-IN inhibitors.


Sign in / Sign up

Export Citation Format

Share Document