A Random Forest with Minority Condensation and Decision Trees for Class Imbalanced Problems

Building an effective classifier that could classify a target or class of instances in a dataset from historical data has played an important role in machine learning for a decade. The standard classification algorithm has difficulty generating an appropriate classifier when faced with an imbalanced dataset. In 2019, the efficient splitting measure, minority condensation entropy (MCE) [1] is proposed that could build a decision tree to classify minority instances. The aim of this research is to extend the concept of a random forest to use both decision trees and minority condensation trees. The algorithm will build a minority condensation tree from a bootstrapped dataset maintaining all minorities while it will build a decision tree from a bootstrapped dataset of a balanced dataset. The experimental results on synthetic datasets apparent the results that confirm this proposed algorithm compared with the standard random forest are suitable for dealing with the binary-class imbalanced problem. Furthermore, the experiment on real-world datasets from the UCI repository shows that this proposed algorithm constructs a random forest that outperforms other existing random forest algorithms based on the recall, the precision, the F-measure, and the Geometric mean

Download Full-text

Evolutionary Algorithm for Improving Decision Tree with Global Discretization in Manufacturing

Sensors ◽

10.3390/s21082849 ◽

2021 ◽

Vol 21 (8) ◽

pp. 2849

Author(s):

Sungbum Jun

Keyword(s):

Decision Tree ◽

Evolutionary Algorithm ◽

Decision Trees ◽

Manufacturing Systems ◽

Ensemble Methods ◽

Machine Learning Techniques ◽

Learning Techniques ◽

Industrial Internet ◽

Tree Models ◽

Real World Datasets

Due to the recent advance in the industrial Internet of Things (IoT) in manufacturing, the vast amount of data from sensors has triggered the need for leveraging such big data for fault detection. In particular, interpretable machine learning techniques, such as tree-based algorithms, have drawn attention to the need to implement reliable manufacturing systems, and identify the root causes of faults. However, despite the high interpretability of decision trees, tree-based models make a trade-off between accuracy and interpretability. In order to improve the tree’s performance while maintaining its interpretability, an evolutionary algorithm for discretization of multiple attributes, called Decision tree Improved by Multiple sPLits with Evolutionary algorithm for Discretization (DIMPLED), is proposed. The experimental results with two real-world datasets from sensors showed that the decision tree improved by DIMPLED outperformed the performances of single-decision-tree models (C4.5 and CART) that are widely used in practice, and it proved competitive compared to the ensemble methods, which have multiple decision trees. Even though the ensemble methods could produce slightly better performances, the proposed DIMPLED has a more interpretable structure, while maintaining an appropriate performance level.

Download Full-text

EVALUATING EFFECTIVENESS OF ENSEMBLE CLASSIFIERS WHEN DETECTING FUZZERS ATTACKS ON THE UNSW-NB15 DATASET

Journal of Computer Science and Cybernetics ◽

10.15625/1813-9663/36/2/14786 ◽

2020 ◽

Vol 36 (2) ◽

pp. 173-185

Author(s):

Hoang Ngoc Thanh ◽

Tran Van Lang

Keyword(s):

Random Forest ◽

Decision Tree ◽

Cyber Security ◽

Experimental Results ◽

Ensemble Classifiers ◽

Research Results ◽

Ensemble Techniques ◽

F Measure ◽

Classification Quality

The UNSW-NB15 dataset was created by the Australian Cyber Security Centre in 2015 by using the IXIA tool to extract normal behaviors and modern attacks, it includes normal data and 9 types of attacks with 49 features. Previous research results show that the detection of Fuzzers attacks in this dataset gives the lowest classification quality. This paper analyzes and evaluates the performance of using known ensemble techniques such as Bagging, AdaBoost, Stacking, Decorate, Random Forest and Voting to detect FUZZERS attacks on UNSW-NB15 dataset to create models. The experimental results show that the AdaBoost technique with the component classifiers using decision tree for the best classification quality with F-Measure is 96.76% compared to 94.16%, which is the best result obtained by using single classifiers and 96.36% by using the Random Forest technique.

Download Full-text

A Predicting Method of Power Grid Transmission Line Icing Based on Decision Trees Modified Model

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.347-350.1903 ◽

2013 ◽

Vol 347-350 ◽

pp. 1903-1906

Author(s):

Jian Wang ◽

Wei Qing Ma ◽

Xiao Long Zhao ◽

Fei Fei Liang

Keyword(s):

Decision Tree ◽

Transmission Line ◽

Decision Trees ◽

Influencing Factors ◽

Information Entropy ◽

Historical Data ◽

Power Grid ◽

Modified Model ◽

Train Of Thought ◽

Attribute Value

Aiming at collapse, disconnection, and ice flash-over problems caused by power grid transmission line icing, this paper proposes a predicting method of transmission line icing. According to historical data and theories of decision tree, calculating out each factors information entropy, with an overall consideration of each influencing factors attribute and attribute value and connections between adjacent layers, we pick out the best testing factor as the root note to complete the model. This model has a clear train of thought, at the same time it has high predicting accuracy and efficiency, which provides a strong guarantee for safely running of power grid transmission line.

Download Full-text

Decision Trees and Random Forest for Privacy-Preserving Data Mining

Research and Development in E-Business through Service-Oriented Solutions - Advances in E-Business Research ◽

10.4018/978-1-4666-4181-5.ch004 ◽

2013 ◽

pp. 71-90 ◽

Cited By ~ 4

Author(s):

Gábor Szucs

Keyword(s):

Data Mining ◽

Random Forest ◽

Decision Tree ◽

Decision Trees ◽

Privacy Preserving ◽

Randomized Response Technique ◽

Random Response ◽

Randomized Response ◽

Classification Problems ◽

Privacy Preserving Data Mining

The objective of this chapter is to present brief literature and new results of research in privacy-preserving data mining as an important privacy issue in the e-business area. The chapter focuses on classification problems in business analytics, where the enterprises can gain large profit using predicted results by classification. The decision tree is a well-known classification technique, and its modification by the Randomized Response technique is described for privacy-preserving data mining. This algorithm is developed for all types of attributes. The largest contribution of this chapter is a new method, so called Random Response Forest, consisting of many decision trees and a randomization technique. Random Response Forest is similar to Random Forest, but it is able to solve privacy problems. This consists of many shallow trees, where a shallow tree is a special decision tree with a Randomized Response technique, and the precision of Random Response Forest is better than at a tree.

Download Full-text

Analisa Terhadap Perbandingan Algoritma Decision Tree Dengan Algoritma Random Tree Untuk Pre-Processing Data

J-SAKTI (Jurnal Sains Komputer dan Informatika) ◽

10.30645/j-sakti.v1i2.41 ◽

2017 ◽

Vol 1 (2) ◽

pp. 180

Author(s):

Saifullah Saifullah ◽

Muhammad Zarlis ◽

Zakaria Zakaria ◽

Rahmat Widia Sembiring

Keyword(s):

Random Forest ◽

Decision Tree ◽

Decision Trees ◽

Performance Improvement ◽

Decision Process ◽

Employee Performance ◽

Decision Makers ◽

Random Tree ◽

Decision Algorithm ◽

Processing Data

Preprocessing data is needed some methods to get better results. This research is intended to process employee dataset as preprocessing input. Furthermore, model decision algorithm is used, random tree and random forest. Decision trees are used to create a model of the rule selected in the decision process. With the results of the preprocessing approach and the model rules obtained, can be a reference for decision makers to decide which variables should be considered to support employee performance improvement

Download Full-text

Automated Development of Clinical Strategies Using Multistage Decision Analysis

Methods of Information in Medicine ◽

10.1055/s-0038-1635469 ◽

1986 ◽

Vol 25 (04) ◽

pp. 207-214 ◽

Cited By ~ 3

Author(s):

P. Glasziou

Keyword(s):

Decision Tree ◽

Decision Analysis ◽

Decision Trees ◽

Optimal Strategy ◽

Cholestatic Jaundice ◽

Clinical Strategies ◽

Simultaneous Study

SummaryThe development of investigative strategies by decision analysis has been achieved by explicitly drawing the decision tree, either by hand or on computer. This paper discusses the feasibility of automatically generating and analysing decision trees from a description of the investigations and the treatment problem. The investigation of cholestatic jaundice is used to illustrate the technique.Methods to decrease the number of calculations required are presented. It is shown that this method makes practical the simultaneous study of at least half a dozen investigations. However, some new problems arise due to the possible complexity of the resulting optimal strategy. If protocol errors and delays due to testing are considered, simpler strategies become desirable. Generation and assessment of these simpler strategies are discussed with examples.

Download Full-text

Prediction of Breast Cancer using Decision tree and Random Forest Algorithm

International Journal of Computer Sciences and Engineering ◽

10.26438/ijcse/v6i2.226229 ◽

2018 ◽

Vol 6 (2) ◽

pp. 226-229

Author(s):

N.Sridevi . ◽

◽

S.Anitha . ◽

Keyword(s):

Breast Cancer ◽

Random Forest ◽

Decision Tree ◽

Random Forest Algorithm

Download Full-text

G-Tric: generating three-way synthetic datasets with triclustering solutions

BMC Bioinformatics ◽

10.1186/s12859-020-03925-4 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

João Lobo ◽

Rui Henriques ◽

Sara C. Madeira

Keyword(s):

State Of The Art ◽

Synthetic Data ◽

Ground Truth ◽

Real Data ◽

Three Dimensions ◽

Additional Advantage ◽

Urban Dynamics ◽

Data Generator ◽

Real World Datasets ◽

Synthetic Datasets

Abstract Background Three-way data started to gain popularity due to their increasing capacity to describe inherently multivariate and temporal events, such as biological responses, social interactions along time, urban dynamics, or complex geophysical phenomena. Triclustering, subspace clustering of three-way data, enables the discovery of patterns corresponding to data subspaces (triclusters) with values correlated across the three dimensions (observations $$\times$$ × features $$\times$$ × contexts). With increasing number of algorithms being proposed, effectively comparing them with state-of-the-art algorithms is paramount. These comparisons are usually performed using real data, without a known ground-truth, thus limiting the assessments. In this context, we propose a synthetic data generator, G-Tric, allowing the creation of synthetic datasets with configurable properties and the possibility to plant triclusters. The generator is prepared to create datasets resembling real 3-way data from biomedical and social data domains, with the additional advantage of further providing the ground truth (triclustering solution) as output. Results G-Tric can replicate real-world datasets and create new ones that match researchers needs across several properties, including data type (numeric or symbolic), dimensions, and background distribution. Users can tune the patterns and structure that characterize the planted triclusters (subspaces) and how they interact (overlapping). Data quality can also be controlled, by defining the amount of missing, noise or errors. Furthermore, a benchmark of datasets resembling real data is made available, together with the corresponding triclustering solutions (planted triclusters) and generating parameters. Conclusions Triclustering evaluation using G-Tric provides the possibility to combine both intrinsic and extrinsic metrics to compare solutions that produce more reliable analyses. A set of predefined datasets, mimicking widely used three-way data and exploring crucial properties was generated and made available, highlighting G-Tric’s potential to advance triclustering state-of-the-art by easing the process of evaluating the quality of new triclustering approaches.

Download Full-text

Machine Learning in Aging: An Example of Developing Prediction Models for Serious Fall Injury in Older Adults

Innovation in Aging ◽

10.1093/geroni/igaa057.859 ◽

2020 ◽

Vol 4 (Supplement_1) ◽

pp. 268-269

Author(s):

Jaime Speiser ◽

Kathryn Callahan ◽

Jason Fanning ◽

Thomas Gill ◽

Anne Newman ◽

...

Keyword(s):

Machine Learning ◽

Older Adults ◽

Random Forest ◽

Decision Tree ◽

Prediction Models ◽

Receiver Operating Curve ◽

Learning Methods ◽

Life Study ◽

Fall Injury ◽

Machine Learning Methods

Abstract Advances in computational algorithms and the availability of large datasets with clinically relevant characteristics provide an opportunity to develop machine learning prediction models to aid in diagnosis, prognosis, and treatment of older adults. Some studies have employed machine learning methods for prediction modeling, but skepticism of these methods remains due to lack of reproducibility and difficulty understanding the complex algorithms behind models. We aim to provide an overview of two common machine learning methods: decision tree and random forest. We focus on these methods because they provide a high degree of interpretability. We discuss the underlying algorithms of decision tree and random forest methods and present a tutorial for developing prediction models for serious fall injury using data from the Lifestyle Interventions and Independence for Elders (LIFE) study. Decision tree is a machine learning method that produces a model resembling a flow chart. Random forest consists of a collection of many decision trees whose results are aggregated. In the tutorial example, we discuss evaluation metrics and interpretation for these models. Illustrated in data from the LIFE study, prediction models for serious fall injury were moderate at best (area under the receiver operating curve of 0.54 for decision tree and 0.66 for random forest). Machine learning methods may offer improved performance compared to traditional models for modeling outcomes in aging, but their use should be justified and output should be carefully described. Models should be assessed by clinical experts to ensure compatibility with clinical practice.

Download Full-text

Robot Perceptual Classification Method Based on Mixed Features of Decision Tree and Random Forest

2021 IEEE 2nd International Conference on Big Data, Artificial Intelligence and Internet of Things Engineering (ICBAIE) ◽

10.1109/icbaie52039.2021.9389973 ◽

2021 ◽

Author(s):

Yifan Song ◽

Jiankai Zuo ◽

Jiehong Wu ◽

Zeyuan Liu ◽

Ziheng Li

Keyword(s):

Random Forest ◽

Decision Tree ◽

Classification Method ◽

Perceptual Classification ◽

Mixed Features

Download Full-text