Automated Screening of Emergency Department Notes for Drug-Associated Bleeding Adverse Events Occurring in Older Adults

Objective To conduct research to show the value of text mining for automatically identifying suspected bleeding adverse drug events (ADEs) in the emergency department (ED). Methods A corpus of ED admission notes was manually annotated for bleeding ADEs. The notes were taken for patients ≥ 65 years of age who had an ICD-9 code for bleeding, the presence of hemoglobin value ≤ 8 g/dL, or were transfused > 2 units of packed red blood cells. This training corpus was used to develop bleeding ADE algorithms using Random Forest and Classification and Regression Tree (CART). A completely separate set of notes was annotated and used to test the classification performance of the final models using the area under the ROC curve (AUROC). Results The best performing CART resulted in an AUROC on the training set of 0.882. The model's AUROC on the test set was 0.827. At a sensitivity of 0.679, the model had a specificity of 0.908 and a positive predictive value (PPV) of 0.814. It had a relatively simple and intuitive structure consisting of 13 decision nodes and 14 leaf nodes. Decision path probabilities ranged from 0.041 to 1.0. The AUROC for the best performing Random Forest method on the training set was 0.917. On the test set, the model's AUROC was 0.859. At a sensitivity of 0.274, the model had a specificity of 0.986 and a PPV of 0.92. Conclusion Both models accurately identify bleeding ADEs using the presence or absence of certain clinical concepts in ED admission notes for older adult patients. The CART model is particularly noteworthy because it does not require significant technical overhead to implement. Future work should seek to replicate the results on a larger test set pulled from another institution.

Download Full-text

A Data-Analytics Tutorial: Building Predictive Models for Oil Production in an Unconventional Shale Reservoir

SPE Journal ◽

10.2118/189969-pa ◽

2018 ◽

Vol 23 (04) ◽

pp. 1075-1089 ◽

Cited By ~ 14

Author(s):

Jared Schuetter ◽

Srikanta Mishra ◽

Ming Zhong ◽

Randy LaFollette (ret.)

Keyword(s):

Predictive Models ◽

Decision Rules ◽

Regression Tree ◽

Production Performance ◽

Classification And Regression Tree ◽

Gradient Boosting ◽

Support Vector ◽

Training Set ◽

Test Set ◽

Well Completion

Summary Considerable amounts of data are being generated during the development and operation of unconventional reservoirs. Statistical methods that can provide data-driven insights into production performance are gaining in popularity. Unfortunately, the application of advanced statistical algorithms remains somewhat of a mystery to petroleum engineers and geoscientists. The objective of this paper is to provide some clarity to this issue, focusing on how to build robust predictive models and how to develop decision rules that help identify factors separating good wells from poor performers. The data for this study come from wells completed in the Wolfcamp Shale Formation in the Permian Basin. Data categories used in the study included well location and assorted metrics capturing various aspects of well architecture, well completion, stimulation, and production. Predictive models for the production metric of interest are built using simple regression and other advanced methods such as random forests (RFs), support-vector regression (SVR), gradient-boosting machine (GBM), and multidimensional Kriging. The data-fitting process involves splitting the data into a training set and a test set, building a regression model on the training set and validating it with the test set. Repeated application of a “cross-validation” procedure yields valuable information regarding the robustness of each regression-modeling approach. Furthermore, decision rules that can identify extreme behavior in production wells (i.e., top x% of the wells vs. bottom x%, as ranked by the production metric) are generated using the classification and regression-tree algorithm. The resulting decision tree (DT) provides useful insights regarding what variables (or combinations of variables) can drive production performance into such extreme categories. The main contributions of this paper are to provide guidelines on how to build robust predictive models, and to demonstrate the utility of DTs for identifying factors responsible for good vs. poor wells.

Download Full-text

Using Decision Tree to Predict Response Rates of Consumer Satisfaction, Attitude, and Loyalty Surveys

Sustainability ◽

10.3390/su11082306 ◽

2019 ◽

Vol 11 (8) ◽

pp. 2306 ◽

Cited By ~ 1

Author(s):

Jian Han ◽

Miaodan Fang ◽

Shenglu Ye ◽

Chuansheng Chen ◽

Qun Wan ◽

...

Keyword(s):

Decision Tree ◽

Consumer Satisfaction ◽

Response Rate ◽

Regression Tree ◽

Response Rates ◽

Classification And Regression Tree ◽

Training Set ◽

Test Set ◽

Predictors Of Response ◽

Predicted Values

Response rate has long been a major concern in survey research commonly used in many fields such as marketing, psychology, sociology, and public policy. Based on 244 published survey studies on consumer satisfaction, loyalty, and trust, this study aimed to identify factors that were predictors of response rates. Results showed that response rates were associated with the mode of data collection (face-to-face > mail/telephone > online), type of survey sponsors (government agencies > universities/research institutions > commercial entities), confidentiality (confidential > non-confidential), direct invitation (yes > no), and cultural orientation (individualism > collectivism). A decision tree regression analysis (using classification and regression Tree (C&RT) algorithm on 80% of the studies as the training set and 20% as the test set) revealed that a model with all above-mentioned factors attained a linear correlation coefficient (0.578) between the predicted values and actual values, which was higher than the corresponding coefficient of the traditional linear regression model (0.423). A decision tree analysis (using C5.0 algorithm on 80% of the studies as the training set and 20% as the test set) revealed that a model with all above-mentioned factors attained an overall accuracy of 78.26% in predicting whether a survey had a high (>50%) or low (<50%) response rate. Direct invitation was the most important factor in all three models and had a consistent trend in predicting response rate.

Download Full-text

Early detection of type II Diabetes Mellitus with random forest and classification and regression tree (CART)

2014 International Conference of Advanced Informatics: Concept, Theory and Application (ICAICTA) ◽

10.1109/icaicta.2014.7005947 ◽

2014 ◽

Cited By ~ 4

Author(s):

M. T. Mira Kania Sabariah ◽

S. T. Aini Hanifa ◽

M. T. Siti Sa'adah

Keyword(s):

Diabetes Mellitus ◽

Random Forest ◽

Early Detection ◽

Type Ii Diabetes ◽

Regression Tree ◽

Type Ii Diabetes Mellitus ◽

Classification And Regression Tree ◽

Type Ii ◽

Classification And Regression

Download Full-text

Landslide susceptibility mapping using random forest, boosted regression tree, classification and regression tree, and general linear models and comparison of their performance at Wadi Tayyah Basin, Asir Region, Saudi Arabia

Landslides ◽

10.1007/s10346-015-0614-1 ◽

2015 ◽

Vol 13 (5) ◽

pp. 839-856 ◽

Cited By ~ 242

Author(s):

Ahmed Mohamed Youssef ◽

Hamid Reza Pourghasemi ◽

Zohre Sadat Pourtaghi ◽

Mohamed M. Al-Katheeri

Keyword(s):

Saudi Arabia ◽

Random Forest ◽

Landslide Susceptibility ◽

Linear Models ◽

Regression Tree ◽

Classification And Regression Tree ◽

Landslide Susceptibility Mapping ◽

Boosted Regression Tree ◽

Classification And Regression ◽

Asir Region

Download Full-text

Exploratory analysis on prediction of loan privilege for customers using random forest

International Journal of Engineering & Technology ◽

10.14419/ijet.v7i2.21.12399 ◽

2018 ◽

Vol 7 (2.21) ◽

pp. 339 ◽

Cited By ~ 1

Author(s):

K Ulaga Priya ◽

S Pushpa ◽

K Kalaivani ◽

A Sartiha

Keyword(s):

Machine Learning ◽

Random Forest ◽

Data Model ◽

Model Evaluation ◽

Banking Industry ◽

Performance Parameters ◽

Training Set ◽

Test Set ◽

Learning Technique ◽

Analytical Processing

In Banking Industry loan Processing is a tedious task in identifying the default customers. Manual prediction of default customers might turn into a bad loan in future. Banks possess huge volume of behavioral data from which they are unable to make a judgement about prediction of loan defaulters. Modern techniques like Machine Learning will help to do analytical processing using Supervised Learning and Unsupervised Learning Technique. A data model for predicting default customers using Random forest Technique has been proposed. Data model Evaluation is done on training set and based on the performance parameters final prediction is done on the Test set. This is an evident that Random Forest technique will help the bank to predict the loan Defaulters with utmost accuracy.

Download Full-text

A comparative study of land subsidence susceptibility mapping of Tasuj plane, Iran, using boosted regression tree, random forest and classification and regression tree methods

Environmental Earth Sciences ◽

10.1007/s12665-020-08953-0 ◽

2020 ◽

Vol 79 (10) ◽

Author(s):

Hamid Ebrahimy ◽

Bakhtiar Feizizadeh ◽

Saeed Salmani ◽

Hossein Azadi

Keyword(s):

Random Forest ◽

Comparative Study ◽

Land Subsidence ◽

Regression Tree ◽

Susceptibility Mapping ◽

Classification And Regression Tree ◽

Boosted Regression Tree ◽

Tree Methods ◽

Classification And Regression

Download Full-text

Analysis of Factors Affecting Hit-and-Run and Non-Hit-and-Run in Vehicle-Bicycle Crashes: A Non-Parametric Approach Incorporating Data Imbalance Treatment

Sustainability ◽

10.3390/su11051327 ◽

2019 ◽

Vol 11 (5) ◽

pp. 1327 ◽

Cited By ~ 4

Author(s):

Bei Zhou ◽

Zongzhi Li ◽

Shengrui Zhang ◽

Xinfen Zhang ◽

Xin Liu ◽

...

Keyword(s):

Geometric Mean ◽

Regression Tree ◽

Classification Performance ◽

Classification And Regression Tree ◽

Factors Affecting ◽

Data Imbalance ◽

Imbalance Problem ◽

Hit And Run ◽

Reported Data ◽

Cart Model

Hit-and-run (HR) crashes refer to crashes involving drivers of the offending vehicle fleeing incident scenes without aiding the possible victims or informing authorities for emergency medical services. This paper aims at identifying significant predictors of HR and non-hit-and-run (NHR) in vehicle-bicycle crashes based on the classification and regression tree (CART) method. An oversampling technique is applied to deal with the data imbalance problem, where the number of minority instances (HR crash) is much lower than that of the majority instances (NHR crash). The police-reported data within City of Chicago from September 2017 to August 2018 is collected. The G-mean (geometric mean) is used to evaluate the classification performance. Results indicate that, compared with original CART model, the G-mean of CART model incorporating data imbalance treatment is increased from 23% to 61% by 171%. The decision tree reveals that the following five variables play the most important roles in classifying HR and NHR in vehicle-bicycle crashes: Driver age, bicyclist safety equipment, driver action, trafficway type, and gender of drivers. Several countermeasures are recommended accordingly. The current study demonstrates that, by incorporating data imbalance treatment, the CART method could provide much more robust classification results.

Download Full-text

A comparative study of logistic model tree, random forest, and classification and regression tree models for spatial prediction of landslide susceptibility

CATENA ◽

10.1016/j.catena.2016.11.032 ◽

2017 ◽

Vol 151 ◽

pp. 147-160 ◽

Cited By ~ 255

Author(s):

Wei Chen ◽

Xiaoshen Xie ◽

Jiale Wang ◽

Biswajeet Pradhan ◽

Haoyuan Hong ◽

...

Keyword(s):

Random Forest ◽

Comparative Study ◽

Landslide Susceptibility ◽

Logistic Model ◽

Regression Tree ◽

Spatial Prediction ◽

Classification And Regression Tree ◽

Tree Models ◽

Logistic Model Tree ◽

Classification And Regression

Download Full-text

A Line Loss Management Method Based on Improved Random Forest Algorithm in Distributed Generation System

Distributed Generation & Alternative Energy Journal ◽

10.13052/dgaej2156-3306.3711 ◽

2021 ◽

Author(s):

Wang Zongbao

Keyword(s):

Power Generation ◽

Random Forest ◽

Random Forest Algorithm ◽

Training Set ◽

Test Set ◽

Distributed Power ◽

Distributed Power Generation ◽

Feature Importance ◽

Feature Relevance ◽

Line Loss

The distributed power generation in Gansu Province is dominated by wind power and photovoltaic power. Most of these distributed power plants are located in underdeveloped areas. Due to the weak local consumption capacity, the distributed electricity is mainly sent and consumed outside. A key indicator that affects ultra-long-distance power transmission is line loss. This is an important indicator of the economic operation of the power system, and it also comprehensively reflects the planning, design, production and operation level of power companies. However, most of the current research on line loss is focused on ultra-high voltage (≧110 KV), and there is less involved in distributed power generation lines below 110 KV. In this study, 35 kV and 110 kV lines are taken as examples, combined with existing weather, equipment, operation, power outages and other data, we summarize and integrate an analysis table of line loss impact factors. Secondly, from the perspective of feature relevance and feature importance, we analyze the factors that affect line loss, and obtain data with higher feature relevance and feature importance ranking. In the experiment, these two factors are determined as the final line loss influence factor. Then, based on the conclusion of the line loss influencing factor, the optimized random forest regression algorithm is used to construct the line loss prediction model. The prediction verification results show that the training set error is 0.021 and the test set error is 0.026. The prediction error of the training set and test set is only 0.005. The experimental results show that the optimized random forest algorithm can indeed analyze the line loss of 35 kV and 110 kV lines well, and can also explain the performance of 110-EaR1120 reasonably.

Download Full-text