Empirical Measurement of Performance Maintenance of Gradient Boosted Decision Tree Models for Malware Detection

Author(s):  
Colin Galen ◽  
Robert Steele
2020 ◽  
Vol 34 (04) ◽  
pp. 5478-5486 ◽  
Author(s):  
Francesco Ranzato ◽  
Marco Zanella

We study the problem of formally and automatically verifying robustness properties of decision tree ensemble classifiers such as random forests and gradient boosted decision tree models. A recent stream of works showed how abstract interpretation, which is ubiquitously used in static program analysis, can be successfully deployed to formally verify (deep) neural networks. In this work we push forward this line of research by designing a general and principled abstract interpretation-based framework for the formal verification of robustness and stability properties of decision tree ensemble models. Our abstract interpretation-based method may induce complete robustness checks of standard adversarial perturbations and output concrete adversarial attacks. We implemented our abstract verification technique in a tool called silva, which leverages an abstract domain of not necessarily closed real hyperrectangles and is instantiated to verify random forests and gradient boosted decision trees. Our experimental evaluation on the MNIST dataset shows that silva provides a precise and efficient tool which advances the current state of the art in tree ensembles verification.


We put forward a tree regularization, which empowers numerous tree models to do feature collection effectively. The type thought of the regularization system be to punish choosing another feature intended for split when its gain is like the features utilized in past splits. This paper utilized standard data set as unique discrete test data, and the entropy and information gain of each trait of the data was determined to actualize the classification of data. Boosted DT are between the most prominent learning systems being used nowadays. Likewise, this paper accomplished an optimized structure of the decision tree, which is streamlined for improving the efficiency of the algorithm on the reason of guaranteeing low error rate which was at a similar dimension as other classification algorithms


We put forward a tree regularization, which empowers numerous tree models to do feature collection effectively. The type thought of the regularization system be to punish choosing another feature intended for split when its gain is like the features utilized in past splits. This paper utilized standard data set as unique discrete test data, and the entropy and information gain of each trait of the data was determined to actualize the classification of data. Boosted DT are between the most prominent learning systems being used nowadays. Likewise, this paper accomplished an optimized structure of the decision tree, which is streamlined for improving the efficiency of the algorithm on the reason of guaranteeing low error rate which was at a similar dimension as other classification algorithms


2013 ◽  
Vol 31 (15_suppl) ◽  
pp. 6553-6553 ◽  
Author(s):  
Jeffrey A. Scott ◽  
Scott Milligan ◽  
Winston Wong ◽  
Daniel Winn ◽  
Joseph Cooper ◽  
...  

6553 Background: Oncology clinical pathways have been suggested as a way to decrease cancer treatment variation and costs. CareFirst BlueCross BlueShield (CFBCBS) partnered with Cardinal Health Specialty Solutions to launch the first cancer clinical pathway in the US in Aug 2008. Savings from that program were reported by Scott et al, ASCO 2010. The purpose of this study was to obtain third-party validation of the observed savings of this pathways program. Methods: We used CFBCBS claims data from Jan 2007 to Dec 2010 to identify patients (pts) with breast, colon, or lung cancer who were treated by physicians participating in the pathways program. We used Truven Health’s MarketScan database to retrospectively identify a control group treated by non-institutional physicians in a similar geographic region outside the CFBCBS network. We further balanced the groups using propensity score weighting to align primary diagnosis and demographics. The primary outcome was the sum of allowed cancer costs for 270 days after a patient’s first chemotherapy treatment. A secondary outcome was the probability of an inpatient (inpt) admission over the same time period. Many generalized linear models were fit for sensitivity testing. Boosted decision tree models were also used to fully capture all nonlinearities and interactions. Both types of models use the propensity score weights. All savings estimates were based on comparing trends between cohorts. Results: A total of 2424 CFBCBS pts were included in the analysis. The aligned control group consisted of 1490 pts. The treatment coefficient from the linear model for the primary outcome was -0.16 with a z-value of -3, which translates to a savings estimate of 15% for the program. The treatment coefficient from the logistic model for the secondary outcome of inpt admission reduction was -0.29 with a z-value of -2.5, which translated to a 7% reduction (from 50% to 43%) in hospital admissions. The boosted decision tree models confirmed results of a more moderate magnitude. Conclusions: We conclude that the CFBCBS pathways program saved upwards of 15% on cancer-related claims costs with a 7% reduction in the probability of an inpt admission. These findings are consistent with those previously presented and peer reviewed.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Nicholas Garside ◽  
Hamed Zaribafzadeh ◽  
Ricardo Henao ◽  
Royce Chung ◽  
Daniel Buckland

AbstractMethods used to predict surgical case time often rely upon the current procedural terminology (CPT) code as a nominal variable to train machine-learned models, however this limits the ability of the model to incorporate new procedures and adds complexity as the number of unique procedures increases. The relative value unit (RVU, a consensus-derived billing indicator) can serve as a proxy for procedure workload and could replace the CPT code as a primary feature for models that predict surgical case length. Using 11,696 surgical cases from Duke University Health System electronic health records data, we compared boosted decision tree models that predict individual case length, changing the method by which the model coded procedure type; CPT, RVU, and CPT–RVU combined. Performance of each model was assessed by inference time, MAE, and RMSE compared to the actual case length on a test set. Models were compared to each other and to the manual scheduler method that currently exists. RMSE for the RVU model (60.8 min) was similar to the CPT model (61.9 min), both of which were lower than scheduler (90.2 min). 65.2% of our RVU model’s predictions (compared to 43.2% from the current human scheduler method) fell within 20% of actual case time. Using RVUs reduced model prediction time by ninefold and reduced the number of training features from 485 to 44. Replacing pre-operative CPT codes with RVUs maintains model performance while decreasing overall model complexity in the prediction of surgical case length.


2021 ◽  
Vol 54 (1) ◽  
pp. 1-38
Author(s):  
Víctor Adrián Sosa Hernández ◽  
Raúl Monroy ◽  
Miguel Angel Medina-Pérez ◽  
Octavio Loyola-González ◽  
Francisco Herrera

Experts from different domains have resorted to machine learning techniques to produce explainable models that support decision-making. Among existing techniques, decision trees have been useful in many application domains for classification. Decision trees can make decisions in a language that is closer to that of the experts. Many researchers have attempted to create better decision tree models by improving the components of the induction algorithm. One of the main components that have been studied and improved is the evaluation measure for candidate splits. In this article, we introduce a tutorial that explains decision tree induction. Then, we present an experimental framework to assess the performance of 21 evaluation measures that produce different C4.5 variants considering 110 databases, two performance measures, and 10× 10-fold cross-validation. Furthermore, we compare and rank the evaluation measures by using a Bayesian statistical analysis. From our experimental results, we present the first two performance rankings in the literature of C4.5 variants. Moreover, we organize the evaluation measures into two groups according to their performance. Finally, we introduce meta-models that automatically determine the group of evaluation measures to produce a C4.5 variant for a new database and some further opportunities for decision tree models.


Sensors ◽  
2021 ◽  
Vol 21 (8) ◽  
pp. 2849
Author(s):  
Sungbum Jun

Due to the recent advance in the industrial Internet of Things (IoT) in manufacturing, the vast amount of data from sensors has triggered the need for leveraging such big data for fault detection. In particular, interpretable machine learning techniques, such as tree-based algorithms, have drawn attention to the need to implement reliable manufacturing systems, and identify the root causes of faults. However, despite the high interpretability of decision trees, tree-based models make a trade-off between accuracy and interpretability. In order to improve the tree’s performance while maintaining its interpretability, an evolutionary algorithm for discretization of multiple attributes, called Decision tree Improved by Multiple sPLits with Evolutionary algorithm for Discretization (DIMPLED), is proposed. The experimental results with two real-world datasets from sensors showed that the decision tree improved by DIMPLED outperformed the performances of single-decision-tree models (C4.5 and CART) that are widely used in practice, and it proved competitive compared to the ensemble methods, which have multiple decision trees. Even though the ensemble methods could produce slightly better performances, the proposed DIMPLED has a more interpretable structure, while maintaining an appropriate performance level.


2021 ◽  
Author(s):  
Thomas Weripuo Gyeera

<div>The National Institute of Standards and Technology defines the fundamental characteristics of cloud computing as: on-demand computing, offered via the network, using pooled resources, with rapid elastic scaling and metered charging. The rapid dynamic allocation and release of resources on demand to meet heterogeneous computing needs is particularly challenging for data centres, which process a huge amount of data characterised by its high volume, velocity, variety and veracity (4Vs model). Data centres seek to regulate this by monitoring and adaptation, typically reacting to service failures after the fact. We present a real cloud test bed with the capabilities of proactively monitoring and gathering cloud resource information for making predictions and forecasts. This contrasts with the state-of-the-art reactive monitoring of cloud data centres. We argue that the behavioural patterns and Key Performance Indicators (KPIs) characterizing virtualized servers, networks, and database applications can best be studied and analysed with predictive models. Specifically, we applied the Boosted Decision Tree machine learning algorithm in making future predictions on the KPIs of a cloud server and virtual infrastructure network, yielding an R-Square of 0.9991 at a 0.2 learning rate. This predictive framework is beneficial for making short- and long-term predictions for cloud resources.</div>


Sign in / Sign up

Export Citation Format

Share Document