Interpretable decision-tree induction in a big data parallel framework

AbstractWhen running data-mining algorithms on big data platforms, a parallel, distributed framework, such asMAPREDUCE, may be used. However, in a parallel framework, each individual model fits the data allocated to its own computing node without necessarily fitting the entire dataset. In order to induce a single consistent model, ensemble algorithms such as majority voting, aggregate the local models, rather than analyzing the entire dataset directly. Our goal is to develop an efficient algorithm for choosing one representative model from multiple, locally induced decision-tree models. The proposed SySM (syntactic similarity method) algorithm computes the similarity between the models produced by parallel nodes and chooses the model which is most similar to others as the best representative of the entire dataset. In 18.75% of 48 experiments on four big datasets, SySM accuracy is significantly higher than that of the ensemble; in about 43.75% of the experiments, SySM accuracy is significantly lower; in one case, the results are identical; and in the remaining 35.41% of cases the difference is not statistically significant. Compared with ensemble methods, the representative tree models selected by the proposed methodology are more compact and interpretable, their induction consumes less memory, and, as confirmed by the empirical results, they allow faster classification of new records.

Download Full-text

Evolutionary Algorithm for Improving Decision Tree with Global Discretization in Manufacturing

Sensors ◽

10.3390/s21082849 ◽

2021 ◽

Vol 21 (8) ◽

pp. 2849

Author(s):

Sungbum Jun

Keyword(s):

Decision Tree ◽

Evolutionary Algorithm ◽

Decision Trees ◽

Manufacturing Systems ◽

Ensemble Methods ◽

Machine Learning Techniques ◽

Learning Techniques ◽

Industrial Internet ◽

Tree Models ◽

Real World Datasets

Due to the recent advance in the industrial Internet of Things (IoT) in manufacturing, the vast amount of data from sensors has triggered the need for leveraging such big data for fault detection. In particular, interpretable machine learning techniques, such as tree-based algorithms, have drawn attention to the need to implement reliable manufacturing systems, and identify the root causes of faults. However, despite the high interpretability of decision trees, tree-based models make a trade-off between accuracy and interpretability. In order to improve the tree’s performance while maintaining its interpretability, an evolutionary algorithm for discretization of multiple attributes, called Decision tree Improved by Multiple sPLits with Evolutionary algorithm for Discretization (DIMPLED), is proposed. The experimental results with two real-world datasets from sensors showed that the decision tree improved by DIMPLED outperformed the performances of single-decision-tree models (C4.5 and CART) that are widely used in practice, and it proved competitive compared to the ensemble methods, which have multiple decision trees. Even though the ensemble methods could produce slightly better performances, the proposed DIMPLED has a more interpretable structure, while maintaining an appropriate performance level.

Download Full-text

Selecting a representative decision tree from an ensemble of decision-tree models for fast big data classification

Journal Of Big Data ◽

10.1186/s40537-019-0186-3 ◽

2019 ◽

Vol 6 (1) ◽

Cited By ~ 4

Author(s):

Abraham Itzhak Weinberg ◽

Mark Last

Keyword(s):

Big Data ◽

Decision Tree ◽

Data Classification ◽

Tree Models ◽

Big Data Classification

Download Full-text

Cost Effectiveness of Caplacizumab in Acquired Thrombotic Thrombocytopenic Purpura

Blood ◽

10.1182/blood-2020-138515 ◽

2020 ◽

Vol 136 (Supplement 1) ◽

pp. 18-19

Author(s):

George Goshua ◽

Pranay Sinha ◽

Jeanne E. Hendrickson ◽

Christopher A. Tormey ◽

Pavan Bendapudi ◽

...

Keyword(s):

Cost Effectiveness ◽

Decision Tree ◽

Markov Model ◽

Hospital Length ◽

Hospital Length Of Stay ◽

Tree Models ◽

The Difference ◽

The Cost ◽

Titan Trial

Introduction: Acquired thrombotic thrombocytopenic purpura (TTP) is a life-threatening disease characterized by thrombotic microangiopathy leading to end-organ damage. The standard of care (SOC) treatment is therapeutic plasma exchange (TPE) alongside immunomodulation with steroids, with increasing use of rituximab +/- other immunomodulatory agents. The addition of caplacizumab, a nanobody targeting von Willebrand factor, was shown to accelerate platelet count recovery, reduce TPE treatments and hospital length of stay, decrease exacerbations and increase relapses in TTP patients treated in the TITAN and HERCULES trials. Given the efficacy of caplacizumab in the TITAN and HERCULES trials, we conducted a cost effectiveness analysis (CEA) of caplacizumab in acquired TTP, representing the first-ever CEA in TTP. Methods: We built decision tree models to evaluate the cost effectiveness of SOC plus caplacizumab versus SOC in acquired TTP based on the results of each of the phase II TITAN trial at 12-month follow-up and the phase III HERCULES trial at 1-month follow-up. Costs were assessed from the health system perspective. For each trial, the SOC cost was calculated as the sum of TPE sessions, hospital length-of-stay (LOS), intensive care unit (ICU) stay, and rituximab use, while the cost of the SOC plus caplacizumab arm included the SOC cost plus the list price of caplacizumab (USD $270,000 per TTP episode). Effectiveness was calculated in quality-adjusted life years (QALY). Cost effectiveness of each treatment arm was calculated as the ratio of cost divided by QALYs. The incremental cost effectiveness ratio (ICER) of adding caplacizumab to SOC was calculated as the difference between the costs of the two treatment arms divided by the difference in QALYs; the ICER was then compared against the 2019 US willingness-to-pay (WTP) threshold of $195,300 USD as a measure of overall cost effectiveness. To avoid potential confounding factors that might inadvertently bias our analysis against the addition of caplacizumab, all values used in our models were selected to maximize cost in the SOC arm and minimize cost in the SOC plus caplacizumab arm in the two clinical trials. We also created a Markov model comparing cost effectiveness of SOC plus caplacizumab versus SOC in acquired TTP with a 5-year time horizon. We performed one-way sensitivity analyses for all models varying parameters including LOS, ICU stay, number of TPE sessions, rituximab use, utilities of the well and diseases states, and caplacizumab cost. Results: In the decision tree models, caplacizumab use yielded a higher cost of treatment compared to SOC alone in both trials (TITAN: $325,647 for caplacizumab plus SOC, versus $89,750 for SOC; HERCULES: $323,547 for caplacizumab plus SOC, versus $83,634 for SOC). An improvement in QALYs with the addition of caplacizumab was noted as compared to SOC in both trials (0.07 in TITAN and 0.26 in HERCULES). The ICER for adding caplacizumab to SOC versus SOC alone was $3.7 million in the TITAN trial and $0.9 million in the HERCULES trial, well above the US WTP threshold. The 5-year horizon Markov model yielded higher cost of caplacizumab treatment compared to SOC alone ($551,878 versus $151,947) and an improvement in QALYs (3.19 versus 2.92). The ICER for adding caplacizumab to SOC was $1.5 million (95% confidence interval $1.25-$1.72 million) with SOC favored in 100% of 10,000 Monte Carlo simulations in a probabilistic sensitivity analysis. Among all parameters, decreasing the cost of caplacizumab had the greatest impact on decreasing the ICERs in all models. The price of caplacizumab treatment for one TTP episode to meet the 2019 US WTP would have to be $46,424 and $80,848 in the TITAN and HERCULES decision tree models, respectively, and $65,106 in a Markov model with a 5-year horizon. Conclusion: The addition of caplacizumab to SOC treatment is not cost effective at its current drug pricing. As our models are designed to maximize the cost effectiveness of caplacizumab, it is very likely that the actual costs incurred by this medication will be much higher than what we report here. Compared to CEA studies of other orphan drugs that, unlike caplacizumab, alter long-term disease course, the costs incurred by caplacizumab treatment in acquired TTP are at the higher end of the spectrum. Additional studies utilizing longer-term follow-up data are warranted to assess the full impact of caplacizumab on the cost of treating TTP. Figure Disclosures No relevant conflicts of interest to declare.

Download Full-text

Fighting Under-price DoS Attack in Ethereum with Machine Learning Techniques

ACM SIGMETRICS Performance Evaluation Review ◽

10.1145/3466826.3466835 ◽

2021 ◽

Vol 48 (4) ◽

pp. 24-27

Author(s):

Jose Eduardo A. Sousa ◽

Vinicius C. Oliveira ◽

Julia Almeida Valadares ◽

Alex Borges Vieira ◽

Heder S. Bernardino ◽

...

Keyword(s):

Machine Learning ◽

Decision Tree ◽

Denial Of Service ◽

Ensemble Methods ◽

Machine Learning Techniques ◽

Security Threats ◽

Network Behavior ◽

Dos Attack ◽

Learning Techniques ◽

Tree Models

Ethereum is one of the most popular cryptocurrency currently and it has been facing security threats and attacks. As a consequence, Ethereum users may experience long periods to validate transactions. Despite the maintenance on the Ethereum mechanisms, there are still indications that it remains susceptible to a sort of attacks. In this work, we analyze the Ethereum network behavior during an under-priced DoS attack, where malicious users try to perform denial-of-service attacks that exploit flaws in the fee mechanism of this cryptocurrency. We propose the application of machine learning techniques and ensemble methods to detect this attack, using the available transaction attributes. The proposals present notable performance as the Decision Tree models, with AUC-ROC, F-score and recall larger than 0.94, 0.82, and 0.98, respectively.

Download Full-text

A decision tree classifier for credit assessment problems in big data environments

Information Systems and e-Business Management ◽

10.1007/s10257-021-00511-w ◽

2021 ◽

Author(s):

Ching-Chin Chern ◽

Weng-U Lei ◽

Kwei-Long Huang ◽

Shu-Yi Chen

Keyword(s):

Big Data ◽

Decision Tree ◽

Decision Tree Classifier ◽

Tree Classifier ◽

Assessment Problems

Download Full-text

Data mining of coronavirus: SARS-CoV-2, SARS-CoV and MERS-CoV

BMC Research Notes ◽

10.1186/s13104-021-05561-4 ◽

2021 ◽

Vol 14 (1) ◽

Author(s):

Jung Eun Huh ◽

Seunghee Han ◽

Taeseon Yoon

Keyword(s):

Machine Learning ◽

Amino Acid ◽

Amino Acid Sequence ◽

Decision Tree ◽

Machine Learning Algorithms ◽

High Similarity ◽

Incubation Periods ◽

Initial Question ◽

The Difference ◽

Blast Program

Abstract Objective In this study we compare the amino acid and codon sequence of SARS-CoV-2, SARS-CoV and MERS-CoV using different statistics programs to understand their characteristics. Specifically, we are interested in how differences in the amino acid and codon sequence can lead to different incubation periods and outbreak periods. Our initial question was to compare SARS-CoV-2 to different viruses in the coronavirus family using BLAST program of NCBI and machine learning algorithms. Results The result of experiments using BLAST, Apriori and Decision Tree has shown that SARS-CoV-2 had high similarity with SARS-CoV while having comparably low similarity with MERS-CoV. We decided to compare the codons of SARS-CoV-2 and MERS-CoV to see the difference. Though the viruses are very alike according to BLAST and Apriori experiments, SVM proved that they can be effectively classified using non-linear kernels. Decision Tree experiment proved several remarkable properties of SARS-CoV-2 amino acid sequence that cannot be found in MERS-CoV amino acid sequence. The consequential purpose of this paper is to minimize the damage on humanity from SARS-CoV-2. Hence, further studies can be focused on the comparison of SARS-CoV-2 virus with other viruses that also can be transmitted during latent periods.

Download Full-text

A Practical Tutorial for Decision Tree Induction

ACM Computing Surveys ◽

10.1145/3429739 ◽

2021 ◽

Vol 54 (1) ◽

pp. 1-38

Author(s):

Víctor Adrián Sosa Hernández ◽

Raúl Monroy ◽

Miguel Angel Medina-Pérez ◽

Octavio Loyola-González ◽

Francisco Herrera

Keyword(s):

Decision Tree ◽

Decision Trees ◽

Machine Learning Techniques ◽

Evaluation Measures ◽

Decision Tree Induction ◽

Learning Techniques ◽

Tree Models ◽

Evaluation Measure ◽

Main Components ◽

Support Decision Making

Experts from different domains have resorted to machine learning techniques to produce explainable models that support decision-making. Among existing techniques, decision trees have been useful in many application domains for classification. Decision trees can make decisions in a language that is closer to that of the experts. Many researchers have attempted to create better decision tree models by improving the components of the induction algorithm. One of the main components that have been studied and improved is the evaluation measure for candidate splits. In this article, we introduce a tutorial that explains decision tree induction. Then, we present an experimental framework to assess the performance of 21 evaluation measures that produce different C4.5 variants considering 110 databases, two performance measures, and 10× 10-fold cross-validation. Furthermore, we compare and rank the evaluation measures by using a Bayesian statistical analysis. From our experimental results, we present the first two performance rankings in the literature of C4.5 variants. Moreover, we organize the evaluation measures into two groups according to their performance. Finally, we introduce meta-models that automatically determine the group of evaluation measures to produce a C4.5 variant for a new database and some further opportunities for decision tree models.

Download Full-text

Financial Supervision Innovation Based on Decision Tree Classification Algorithm in Big Data Era

10.1145/3482632.3482722 ◽

2021 ◽

Author(s):

Yunhong Li ◽

Lingling Zhang

Keyword(s):

Big Data ◽

Decision Tree ◽

Classification Algorithm ◽

Financial Supervision ◽

Decision Tree Classification

Download Full-text

Differential Diagnosis Model of Hypocellular Myelodysplastic Syndrome and Aplastic Anemia Based on the Medical Big Data Platform

Complexity ◽

10.1155/2018/4824350 ◽

2018 ◽

Vol 2018 ◽

pp. 1-12 ◽

Cited By ~ 9

Author(s):

Jianhui Wu ◽

Lu Zhang ◽

Sufeng Yin ◽

Haidong Wang ◽

Guoli Wang ◽

...

Keyword(s):

Neural Network ◽

Logistic Regression ◽

Big Data ◽

Differential Diagnosis ◽

Decision Tree ◽

Cell Morphology ◽

Bp Neural Network ◽

Data Platform ◽

New Ideas ◽

Medical Big Data

The arrival of the era of big data has brought new ideas to solve problems for all walks of life. Medical clinical data is collected and stored in the medical field by utilizing the medical big data platform. Based on medical information big data, new ideas and methods for the differential diagnosis of hypo-MDS and AA are studied. The basic information, peripheral blood classification counts, peripheral blood cell morphology, bone marrow cell morphology, and other information were collected from patients diagnosed with hypo-MDS and AA diagnosed in the first diagnosis. First, statistical analysis was performed. Then, the logistic regression model, decision tree model, BP neural network model, and support vector machine (SVM) model of hypo-MDS and AA were established. The sensitivity, specificity, Youden index, positive likelihood ratio (+LR), negative likelihood ratio (−LR), area under curve (AUC), accuracy, Kappa value, positive predictive value (+PV), negative predictive value (−PV) of the four model training set and test set were compared, respectively. Finally, with the support of medical big data, using logistic regression, decision tree, BP neural network, and SVM four classification algorithms, the decision tree algorithm is optimal for the classification of hypo-MDS and AA and analyzes the characteristics of the optimal model misjudgment data.

Download Full-text

Cost and Hazard Decision Tree Models

Reliability Models of Complex Systems for Robots and Automation ◽

10.1201/b22491-3 ◽

2017 ◽

pp. 23-31

Author(s):

Hamed Fazlollahtabar ◽

Seyed Taghi Akhavan Niaki

Keyword(s):

Decision Tree ◽

Tree Models

Download Full-text