Machine learning imputation of metastatic status from open claims in melanoma patients.

2021 ◽  
Vol 39 (15_suppl) ◽  
pp. e21540-e21540
Author(s):  
Vivek Prabhakar Vaidya ◽  
Rambaksh Prajapati ◽  
Sai Vinod Manirevu ◽  
Rohini George ◽  
Smita Agrawal ◽  
...  

e21540 Background: Metastatic status is a crucial variable in most oncology studies but is not available in claims data. The objective of this study is to develop a machine learning model for Imputation of metastatic status from claims data with ground. Truth is derived from highly curated electronic medical record data. Methods: We used a set of 11389 melanoma patients from the ConcertAI real world database of intersecting claims and EMR data that includes data from CancerLinQ Discovery. Using features from claims and our gold standard labels from EMR we built an ML model using (XGBoost) extreme gradient boosting, an algorithm that iteratively combines a set of decision trees into a single model. We used 60% of the data for training, 20% for hyper-parameter tuning, and 20% for holdout testing. The model was built using 55 features. Results: The table below summarizes results. Metrics are on the final hold out set which was unseen by the model and entirely composed of highly curated EMR data. Conclusions: We are able to build a high precision model for the imputation of metastatic melanoma status using claims data. This could enable significantly better use of claims data stemming from the ability to find a metastatic cohort with very few false positives. Providing more precise cohort identification for comparative effectiveness studies. We found features such as secondary neoplasm diagnosis, anti-neoplastic meds, and radiation ranking highly in our analysis of model feature importances. Using techniques to analyze non-linear feature interactions in our AI model we found an interaction relationship between long term anti-neoplastic therapy, reported pain and metastatic status which we plan to further study. This work is preliminary and we are working to further improve model performance.[Table: see text]

Author(s):  
Kerstin Bach ◽  
Atle Kongsvold ◽  
Hilde Bårdstu ◽  
Ellen Marie Bardal ◽  
Håkon S. Kjærnli ◽  
...  

Introduction: Accelerometer-based measurements of physical activity types are commonly used to replace self-reports. To advance the field, it is desirable that such measurements allow accurate detection of key daily physical activity types. This study aimed to evaluate the performance of a machine learning classifier for detecting sitting, standing, lying, walking, running, and cycling based on a dual versus single accelerometer setups during free-living. Methods: Twenty-two adults (mean age [SD, range] 38.7 [14.4, 25–68] years) were wearing two Axivity AX3 accelerometers positioned on the low back and thigh along with a GoPro camera positioned on the chest to record lower body movements during free-living. The labeled videos were used as ground truth for training an eXtreme Gradient Boosting classifier using window lengths of 1, 3, and 5 s. Performance of the classifier was evaluated using leave-one-out cross-validation. Results: Total recording time was ∼38 hr. Based on 5-s windowing, the overall accuracy was 96% for the dual accelerometer setup and 93% and 84% for the single thigh and back accelerometer setups, respectively. The decreased accuracy for the single accelerometer setup was due to a poor precision in detecting lying based on the thigh accelerometer recording (77%) and standing based on the back accelerometer recording (64%). Conclusion: Key daily physical activity types can be accurately detected during free-living based on dual accelerometer recording, using an eXtreme Gradient Boosting classifier. The overall accuracy decreases marginally when predictions are based on single thigh accelerometer recording, but detection of lying is poor.


2021 ◽  
Author(s):  
Freddy J. Marquez

Abstract Machine Learning is an artificial intelligence subprocess applied to automatically and quickly perform mathematical calculations to data in order to build models used to make predictions. Technical papers related to machine learning algorithms applications have being increasingly published in many oil and gas disciplines over the last five years, revolutionizing the way engineers approach to their works, and sharing innovating solutions that contributes to an increase in efficiency. In this paper, Machine Learning models are built to predict inverse rate of penetration (ROPI) and surface torque for a well located at Gulf of Mexico shallow waters. Three type of analysis were performed. Pre-drill analysis, predicting the parameters without any data of the target well in the database. Drilling analysis, running the model every sixty meters, updating the database with information of the target well and predicting the parameters ahead the bit. Sensitivity parameter optimization analysis was performed iterating weight on bit and rotary speed values as model inputs in order identify the optimum combination to deliver the best drilling performance under the given conditions. The Extreme Gradient Boosting (XGBoost) library in Python programming language environment, was used to build the models. Model performance was satisfactory, overcoming the challenge of using drilling parameters input manually by drilling bit engineers. The database was built with data from different fields and wells. Two databases were created to build the models, one of the models did not consider logging while drilling (LWD) data in order to determine its importance on the predictions. Pre-drill surface torque prediction showed better performance than ROPI. Predictions ahead the bit performance was good both for torque and ROPI. Sensitivity parameter optimization showed better resolution with the database that includes LWD data.


2021 ◽  
Vol 13 (12) ◽  
pp. 2242
Author(s):  
Jianzhao Liu ◽  
Yunjiang Zuo ◽  
Nannan Wang ◽  
Fenghui Yuan ◽  
Xinhao Zhu ◽  
...  

The net ecosystem CO2 exchange (NEE) is a critical parameter for quantifying terrestrial ecosystems and their contributions to the ongoing climate change. The accumulation of ecological data is calling for more advanced quantitative approaches for assisting NEE prediction. In this study, we applied two widely used machine learning algorithms, Random Forest (RF) and Extreme Gradient Boosting (XGBoost), to build models for simulating NEE in major biomes based on the FLUXNET dataset. Both models accurately predicted NEE in all biomes, while XGBoost had higher computational efficiency (6~62 times faster than RF). Among environmental variables, net solar radiation, soil water content, and soil temperature are the most important variables, while precipitation and wind speed are less important variables in simulating temporal variations of site-level NEE as shown by both models. Both models perform consistently well for extreme climate conditions. Extreme heat and dryness led to much worse model performance in grassland (extreme heat: R2 = 0.66~0.71, normal: R2 = 0.78~0.81; extreme dryness: R2 = 0.14~0.30, normal: R2 = 0.54~0.55), but the impact on forest is less (extreme heat: R2 = 0.50~0.78, normal: R2 = 0.59~0.87; extreme dryness: R2 = 0.86~0.90, normal: R2 = 0.81~0.85). Extreme wet condition did not change model performance in forest ecosystems (with R2 changing −0.03~0.03 compared with normal) but led to substantial reduction in model performance in cropland (with R2 decreasing 0.20~0.27 compared with normal). Extreme cold condition did not lead to much changes in model performance in forest and woody savannas (with R2 decreasing 0.01~0.08 and 0.09 compared with normal, respectively). Our study showed that both models need training samples at daily timesteps of >2.5 years to reach a good model performance and >5.4 years of daily samples to reach an optimal model performance. In summary, both RF and XGBoost are applicable machine learning algorithms for predicting ecosystem NEE, and XGBoost algorithm is more feasible than RF in terms of accuracy and efficiency.


2021 ◽  
Author(s):  
Hung Vo-Thanh ◽  
Kang-Kun Lee

Abstract Carbon dioxide (CO2) storage in saline formations has been identified as a practical approach to reducing CO2 levels in the atmosphere. The residual and solubility of CO2 in deep saline aquifers are essential mechanisms to enhance security in storing CO2. In this research, CO2 residual and solubility in saline formations have been predicted by adapting three Machine Learning models called Random Forest (RF), extreme gradient boosting (XGboost), and Support Vector Regression (SVR). Consequently, a diversity of the field-scale simulation database including 1509 data samples retrieved from reliable studies, was considered to train and test the proposed models to achieve this task. Graphical and statistical indicators were evaluated and compared the predictive ML model performance. The predicted results denoted that the proposed ML models are ranked from high to low as follows: XGboost>RF>SVR. Additionally, the performance analyses revealed that the XGboost model demonstrates higher accuracy in predicting CO2 trapping efficiency in saline formation than previous ML models. The XGboost model yields very low root mean square error (RMSE) and R2 for both residual and solubility trapping efficiency. At last, the applicable domain of XGboost model was validated, and only 24 suspected data points were recognized from the entire databank.


Author(s):  
Tianhang Chen ◽  
Xiangeng Wang ◽  
Yanyi Chu ◽  
Dong-Qing Wei ◽  
Yi Xiong

AbstractType IV secreted effectors (T4SEs) can be translocated into the cytosol of host cells via type IV secretion system (T4SS) and cause diseases. However, experimental approaches to identify T4SEs are time- and resource-consuming, and the existing computational tools based on machine learning techniques have some obvious limitations such as the lack of interpretability in the prediction models. In this study, we proposed a new model, T4SE-XGB, which uses the eXtreme gradient boosting (XGBoost) algorithm for accurate identification of type IV effectors based on optimal features based on protein sequences. After trying 20 different types of features, the best performance was achieved when all features were fed into XGBoost by the 5-fold cross validation in comparison with other machine learning methods. Then, the ReliefF algorithm was adopted to get the optimal feature set on our dataset, which further improved the model performance. T4SE-XGB exhibited highest predictive performance on the independent test set and outperformed other published prediction tools. Furthermore, the SHAP method was used to interpret the contribution of features to model predictions. The identification of key features can contribute to improved understanding of multifactorial contributors to host-pathogen interactions and bacterial pathogenesis. In addition to type IV effector prediction, we believe that the proposed framework can provide instructive guidance for similar studies to construct prediction methods on related biological problems. The data and source code of this study can be freely accessed at https://github.com/CT001002/T4SE-XGB.


2021 ◽  
Vol 11 (11) ◽  
pp. 1055
Author(s):  
Pei-Chen Lin ◽  
Kuo-Tai Chen ◽  
Huan-Chieh Chen ◽  
Md. Mohaimenul Islam ◽  
Ming-Chin Lin

Accurate stratification of sepsis can effectively guide the triage of patient care and shared decision making in the emergency department (ED). However, previous research on sepsis identification models focused mainly on ICU patients, and discrepancies in model performance between the development and external validation datasets are rarely evaluated. The aim of our study was to develop and externally validate a machine learning model to stratify sepsis patients in the ED. We retrospectively collected clinical data from two geographically separate institutes that provided a different level of care at different time periods. The Sepsis-3 criteria were used as the reference standard in both datasets for identifying true sepsis cases. An eXtreme Gradient Boosting (XGBoost) algorithm was developed to stratify sepsis patients and the performance of the model was compared with traditional clinical sepsis tools; quick Sequential Organ Failure Assessment (qSOFA) and Systemic Inflammatory Response Syndrome (SIRS). There were 8296 patients (1752 (21%) being septic) in the development and 1744 patients (506 (29%) being septic) in the external validation datasets. The mortality of septic patients in the development and validation datasets was 13.5% and 17%, respectively. In the internal validation, XGBoost achieved an area under the receiver operating characteristic curve (AUROC) of 0.86, exceeding SIRS (0.68) and qSOFA (0.56). The performance of XGBoost deteriorated in the external validation (the AUROC of XGBoost, SIRS and qSOFA was 0.75, 0.57 and 0.66, respectively). Heterogeneity in patient characteristics, such as sepsis prevalence, severity, age, comorbidity and infection focus, could reduce model performance. Our model showed good discriminative capabilities for the identification of sepsis patients and outperformed the existing sepsis identification tools. Implementation of the ML model in the ED can facilitate timely sepsis identification and treatment. However, dataset discrepancies should be carefully evaluated before implementing the ML approach in clinical practice. This finding reinforces the necessity for future studies to perform external validation to ensure the generalisability of any developed ML approaches.


Data ◽  
2021 ◽  
Vol 6 (8) ◽  
pp. 80
Author(s):  
O. V. Mythreyi ◽  
M. Rohith Srinivaas ◽  
Tigga Amit Kumar ◽  
R. Jayaganthan

This research work focuses on machine-learning-assisted prediction of the corrosion behavior of laser-powder-bed-fused (LPBF) and postprocessed Inconel 718. Corrosion testing data of these specimens were collected and fit into the following machine learning algorithms: polynomial regression, support vector regression, decision tree, and extreme gradient boosting. The model performance, after hyperparameter optimization, was evaluated using a set of established metrics: R2, mean absolute error, and root mean square error. Among the algorithms, the extreme gradient boosting algorithm performed best in predicting the corrosion behavior, closely followed by other algorithms. Feature importance analysis was executed in order to determine the postprocessing parameters that influenced the most the corrosion behavior in Inconel 718 manufactured by LPBF.


2021 ◽  
Vol 9 (2) ◽  
pp. 156
Author(s):  
Jian He ◽  
Yong Hao ◽  
Xiaoqiong Wang

The reasonable decision of ship detention plays a vital role in flag state control (FSC). Machine learning algorithms can be applied as aid tools for identifying ship detention. In this study, we propose a novel interpretable ship detention decision-making model based on machine learning, termed SMOTE-XGBoost-Ship detention model (SMO-XGB-SD), using the extreme gradient boosting (XGBoost) algorithm and the synthetic minority oversampling technique (SMOTE) algorithm to identify whether a ship should be detained. Our verification results show that the SMO-XGB-SD algorithm outperforms random forest (RF), support vector machine (SVM), and logistic regression (LR) algorithm. In addition, the new algorithm also provides a reasonable interpretation of model performance and highlights the most important features for identifying ship detention using the Shapley additive explanations (SHAP) algorithm. The SMO-XGB-SD model provides an effective basis for aiding decisions on ship detention by inland flag state control officers (FSCOs) and the ship safety management of ship operating companies, as well as training services for new FSCOs in maritime organizations.


2019 ◽  
Author(s):  
Kasper Van Mens ◽  
Joran Lokkerbol ◽  
Richard Janssen ◽  
Robert de Lange ◽  
Bea Tiemens

BACKGROUND It remains a challenge to predict which treatment will work for which patient in mental healthcare. OBJECTIVE In this study we compare machine algorithms to predict during treatment which patients will not benefit from brief mental health treatment and present trade-offs that must be considered before an algorithm can be used in clinical practice. METHODS Using an anonymized dataset containing routine outcome monitoring data from a mental healthcare organization in the Netherlands (n = 2,655), we applied three machine learning algorithms to predict treatment outcome. The algorithms were internally validated with cross-validation on a training sample (n = 1,860) and externally validated on an unseen test sample (n = 795). RESULTS The performance of the three algorithms did not significantly differ on the test set. With a default classification cut-off at 0.5 predicted probability, the extreme gradient boosting algorithm showed the highest positive predictive value (ppv) of 0.71(0.61 – 0.77) with a sensitivity of 0.35 (0.29 – 0.41) and area under the curve of 0.78. A trade-off can be made between ppv and sensitivity by choosing different cut-off probabilities. With a cut-off at 0.63, the ppv increased to 0.87 and the sensitivity dropped to 0.17. With a cut-off of at 0.38, the ppv decreased to 0.61 and the sensitivity increased to 0.57. CONCLUSIONS Machine learning can be used to predict treatment outcomes based on routine monitoring data.This allows practitioners to choose their own trade-off between being selective and more certain versus inclusive and less certain.


2021 ◽  
Vol 13 (5) ◽  
pp. 1021
Author(s):  
Hu Ding ◽  
Jiaming Na ◽  
Shangjing Jiang ◽  
Jie Zhu ◽  
Kai Liu ◽  
...  

Artificial terraces are of great importance for agricultural production and soil and water conservation. Automatic high-accuracy mapping of artificial terraces is the basis of monitoring and related studies. Previous research achieved artificial terrace mapping based on high-resolution digital elevation models (DEMs) or imagery. As a result of the importance of the contextual information for terrace mapping, object-based image analysis (OBIA) combined with machine learning (ML) technologies are widely used. However, the selection of an appropriate classifier is of great importance for the terrace mapping task. In this study, the performance of an integrated framework using OBIA and ML for terrace mapping was tested. A catchment, Zhifanggou, in the Loess Plateau, China, was used as the study area. First, optimized image segmentation was conducted. Then, features from the DEMs and imagery were extracted, and the correlations between the features were analyzed and ranked for classification. Finally, three different commonly-used ML classifiers, namely, extreme gradient boosting (XGBoost), random forest (RF), and k-nearest neighbor (KNN), were used for terrace mapping. The comparison with the ground truth, as delineated by field survey, indicated that random forest performed best, with a 95.60% overall accuracy (followed by 94.16% and 92.33% for XGBoost and KNN, respectively). The influence of class imbalance and feature selection is discussed. This work provides a credible framework for mapping artificial terraces.


Sign in / Sign up

Export Citation Format

Share Document