Instance Reduction for Avoiding Overfitting in Decision Trees

Abstract Decision trees learning is one of the most practical classification methods in machine learning, which is used for approximating discrete-valued target functions. However, they may overfit the training data, which limits their ability to generalize to unseen instances. In this study, we investigated the use of instance reduction techniques to smooth the decision boundaries before training the decision trees. Noise filters such as ENN, RENN, and ALLKNN remove noisy instances while DROP3 and DROP5 may remove genuine instances. Extensive empirical experiments were conducted on 13 benchmark datasets from UCI machine learning repository with and without intentionally introduced noise. Empirical results show that eliminating border instances improves the classification accuracy of decision trees and reduces the tree size, which reduces the training and classification times. In datasets without intentionally added noise, applying noise filters without the use of the built-in Reduced Error Pruning gave the best classification accuracy. ENN, RENN, and ALLKNN outperformed decision trees learning without pruning in 9, 9, and 8 out of 13 datasets, respectively. The datasets reduced using ENN and RENN without built-in pruning were more effective when noise was intentionally introduced in different ratios.

Download Full-text

Application of multi-sensor unmanned aerial system for identification of hydrothermal alteration zones

10.5194/egusphere-egu2020-12546 ◽

2020 ◽

Author(s):

Yosoon Choi ◽

Jieun Baek ◽

Jangwon Suh ◽

Sung-Min Kim

Keyword(s):

Machine Learning ◽

Classification Accuracy ◽

Training Data ◽

Sensor Data ◽

Machine Learning Techniques ◽

Integrated Analysis ◽

Unmanned Aerial System ◽

Data Sets ◽

Learning Techniques ◽

Hydrothermal Alteration Zones

<p>In this study, we proposed a method to utilize a multi-sensor Unmanned Aerial System (UAS) for exploration of hydrothermal alteration zones. This study selected an area (10m &#215; 20m) composed mainly of the andesite and located on the coast, with wide outcrops and well-developed structural and mineralization elements. Multi-sensor (visible, multispectral, thermal, magnetic) data were acquired in the study area using UAS, and were studied using machine learning techniques. For utilizing the machine learning techniques, we applied the stratified random method to sample 1000 training data in the hydrothermal zone and 1000 training data in the non-hydrothermal zone identified through the field survey. The 2000 training data sets created for supervised learning were first classified into 1500 for training and 500 for testing. Then, 1500 for training were classified into 1200 for training and 300 for validation. The training and validation data for machine learning were generated in five sets to enable cross-validation. Five types of machine learning techniques were applied to the training data sets: k-Nearest Neighbors (k-NN), Decision Tree (DT), Random Forest (RF), Support Vector Machine (SVM), and Deep Neural Network (DNN). As a result of integrated analysis of multi-sensor data using five types of machine learning techniques, RF and SVM techniques showed high classification accuracy of about 90%. Moreover, performing integrated analysis using multi-sensor data showed relatively higher classification accuracy in all five machine learning techniques than analyzing magnetic sensing data or single optical sensing data only.</p>

Download Full-text

Development and Optimization of VGF-GaAs Crystal Growth Process Using Data Mining and Machine Learning Techniques

Crystals ◽

10.3390/cryst11101218 ◽

2021 ◽

Vol 11 (10) ◽

pp. 1218

Author(s):

Natasha Dropka ◽

Klaus Böttcher ◽

Martin Holena

Keyword(s):

Machine Learning ◽

Data Mining ◽

Crystal Growth ◽

Decision Trees ◽

Growth Process ◽

Training Data ◽

Machine Learning Techniques ◽

Interface Position ◽

Crystal Growth Process ◽

Learning Techniques

The aim of this study was to assess the ability of the various data mining and supervised machine learning techniques: correlation analysis, k-means clustering, principal component analysis and decision trees (regression and classification), to derive, optimize and understand the factors influencing VGF-GaAs growth. Training data were generated by Computational Fluid Dynamics (CFD) simulations and consisted of 130 datasets with 6 inputs (growth rate and power of 5 heaters) and 5 outputs (interface position and deflection, and temperatures at various positions in GaAs). Data mining results confirmed a good dispersion of the training data without the feasibility of a dimensionality reduction. Data clustering was observed in relation to the position of the crystallization front relative to the side heaters. Based on the statistical performance criteria and training results, decision trees identified the most decisive inputs and their ranges for a favorable interface shape and to keep GaAs temperature beyond limits for heavy arsenic evaporation. Decision trees are a recommendable machine learning technique with short training times and acceptable predictive accuracy based on small volume of CFD training data, capable of providing guidelines for understanding the crystal growth process, which is a prerequisite for the growth of low-cost, high-quality bulk crystals.

Download Full-text

Using Machine Learning Methods to Develop a Short Tree-Based Adaptive Classification Test: Case Study With a High-Dimensional Item Pool and Imbalanced Data

Applied Psychological Measurement ◽

10.1177/0146621620931198 ◽

2020 ◽

Vol 44 (7-8) ◽

pp. 499-514

Author(s):

Yi Zheng ◽

Hyunjung Cheon ◽

Charles M. Katz

Keyword(s):

Machine Learning ◽

Classification Accuracy ◽

Imbalanced Data ◽

Training Data ◽

Adaptive Tests ◽

Promising Alternative ◽

Adaptive Classification ◽

Short Tree ◽

Classification Test

This study explores advanced techniques in machine learning to develop a short tree-based adaptive classification test based on an existing lengthy instrument. A case study was carried out for an assessment of risk for juvenile delinquency. Two unique facts of this case are (a) the items in the original instrument measure a large number of distinctive constructs; (b) the target outcomes are of low prevalence, which renders imbalanced training data. Due to the high dimensionality of the items, traditional item response theory (IRT)-based adaptive testing approaches may not work well, whereas decision trees, which are developed in the machine learning discipline, present as a promising alternative solution for adaptive tests. A cross-validation study was carried out to compare eight tree-based adaptive test constructions with five benchmark methods using data from a sample of 3,975 subjects. The findings reveal that the best-performing tree-based adaptive tests yielded better classification accuracy than the benchmark method IRT scoring with optimal cutpoints, and yielded comparable or better classification accuracy than the best benchmark method, random forest with balanced sampling. The competitive classification accuracy of the tree-based adaptive tests also come with an over 30-fold reduction in the length of the instrument, only administering between 3 to 6 items to any individual. This study suggests that tree-based adaptive tests have an enormous potential when used to shorten instruments that measure a large variety of constructs.

Download Full-text

Development and Validation of a Data-Driven Fault Detection and Diagnosis System for Chillers Using Machine Learning Algorithms

Energies ◽

10.3390/en14071945 ◽

2021 ◽

Vol 14 (7) ◽

pp. 1945

Author(s):

Icksung Kim ◽

Woohyun Kim

Keyword(s):

Fault Detection ◽

Classification Accuracy ◽

Energy Savings ◽

Cost Savings ◽

Fault Detection And Diagnosis ◽

Machine Learning Algorithms ◽

Training Data ◽

Data Driven ◽

Classification Methods ◽

Detection And Diagnosis

Fault detection and diagnosis (FDD) systems enable high cost savings and energy savings that could have economic and environmental impact. This study aims to develop and validate a data-driven FDD system for a chiller. The system uses historical operation data to capture quantitative correlations among system variables. This study evaluated the effectiveness and robustness of eight FDD classification methods based on the experimental data of the chiller (the ASHRAE 1043-RP project). The training data used for the FDD system is classified into four cases. Moreover, true and false positive rates are used to characterize the performance of the classification methods. The results show that local fault is not significantly sensitive to training data, and shows high classification accuracy for all cases. The system fault has a significant effect on the amount of data and the severity levels on the classification accuracy.

Download Full-text

Learning Optimal Decision Trees with SAT

Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2018/189 ◽

2018 ◽

Cited By ~ 8

Author(s):

Nina Narodytska ◽

Alexey Ignatiev ◽

Filipe Pereira ◽

Joao Marques-Silva

Keyword(s):

Machine Learning ◽

Decision Tree ◽

Decision Trees ◽

Practical Interest ◽

Training Data ◽

Fundamental Importance ◽

Optimal Decision ◽

Past Work ◽

Computational Problem ◽

Natural Mapping

Explanations of machine learning (ML) predictions are of fundamental importance in different settings. Moreover, explanations should be succinct, to enable easy understanding by humans. Decision trees represent an often used approach for developing explainable ML models, motivated by the natural mapping between decision tree paths and rules. Clearly, smaller trees correlate well with smaller rules, and so one challenge is to devise solutions for computing smallest size decision trees given training data. Although simple to formulate, the computation of smallest size decision trees turns out to be an extremely challenging computational problem, for which no practical solutions are known. This paper develops a SAT-based model for computing smallest-size decision trees given training data. In sharp contrast with past work, the proposed SAT model is shown to scale for publicly available datasets of practical interest.

Download Full-text

Hierarchy-guided Neural Networks for Species Classification

10.1101/2021.01.17.427006 ◽

2021 ◽

Cited By ~ 1

Author(s):

Mohannad Elhamod ◽

Kelly M. Diamond ◽

A. Murat Maga ◽

Yasin Bakis ◽

Henry L. Bart ◽

...

Keyword(s):

Neural Network ◽

Machine Learning ◽

Neural Networks ◽

Classification Accuracy ◽

Training Data ◽

List Type ◽

Species Classification ◽

Phylogenetic Information ◽

Proposed Model ◽

Fish Distributions

AbstractFish species classification is an important task that is the foundation of many industrial, commercial, ecological, and scientific applications involving the study of fish distributions, dynamics, and evolution.While conventional approaches for this task use off-the-shelf machine learning (ML) methods such as existing Convolutional Neural Network (ConvNet) architectures, there is an opportunity to inform the ConvNet architecture using our knowledge of biological hierarchies among taxonomic classes.In this work, we propose infusing phylogenetic information into the model’s training to guide its structure and relationships among the extracted features. In our extensive experimental analyses, the proposed model, named Hierarchy-Guided Neural Network (HGNN), outperforms conventional ConvNet models in terms of classification accuracy under scarce training data conditions.We also observe that HGNN shows better resilience to adversarial occlusions, when some of the most informative patch regions of the image are intentionally blocked and their effect on classification accuracy is studied.

Download Full-text

Hardware Acceleration of Sparse Oblique Decision Trees for Edge Computing

Elektronika ir Elektrotechnika ◽

10.5755/j01.eie.25.5.24351 ◽

2019 ◽

Vol 25 (5) ◽

pp. 18-24

Author(s):

Predrag Teodorovic ◽

Rastislav Struharik

Keyword(s):

Machine Learning ◽

Decision Trees ◽

Hardware Acceleration ◽

Hardware Accelerator ◽

Significant Interest ◽

Benchmark Datasets ◽

Edge Based ◽

Specialized Hardware ◽

Memory Resources ◽

Novel Algorithm

This paper presents a hardware accelerator for sparse decision trees intended for FPGA applications. To the best of authors’ knowledge, this is the first accelerator of this type. Beside the hardware accelerator itself, a novel algorithm for induction of sparse decision trees is also presented. Sparse decision trees can be attractive because they require less memory resources and can be more efficiently processed using specialized hardware compared to traditional oblique decision trees. This can be of significant interest, particularly, in the edge-based applications, where memory and compute resources as well as power consumption are severely constrained. The performance of the proposed sparse decision tree induction algorithm as well as developed hardware accelerator are studied using standard benchmark datasets obtained from the UCI Machine Learning Repository database. The results of the experimental study indicate that the proposed algorithm and hardware accelerator are very favourably compared with some of the existing solutions.

Download Full-text

Real Time Efficient Accident Predictor System using Machine Learning Techniques (kNN, RF, LR, DT)

International Journal of Engineering and Advanced Technology - Regular Issue ◽

10.35940/ijeat.d6910.1210220 ◽

2020 ◽

Vol 10 (2) ◽

pp. 108-111

Keyword(s):

Machine Learning ◽

Random Forest ◽

Real Time ◽

Classification Accuracy ◽

Nearest Neighbors ◽

Machine Learning Algorithms ◽

Machine Learning Techniques ◽

Classification Methods ◽

K Nearest Neighbors ◽

Learning Techniques

Real time crash predictor system is determining frequency of crashes and also severity of crashes. Nowadays machine learning based methods are used to predict the total number of crashes. In this project, prediction accuracy of machine learning algorithms like Decision tree (DT), K-nearest neighbors (KNN), Random forest (RF), Logistic Regression (LR) are evaluated. Performance analysis of these classification methods are evaluated in terms of accuracy. Dataset included for this project is obtained from 49 states of US and 27 states of India which contains 2.25 million US accident crash records and 1.16 million crash records respectively. Results prove that classification accuracy obtained from Random Forest (RF) is96% compared to other classification methods.

Download Full-text

Maschinelles Lernen zur Prognose von Auftragskennzahlen/Machine learning for the forecasting of key figures of customer orders

wt Werkstattstechnik online ◽

10.37544/1436-4980-2021-03-32 ◽

2021 ◽

Vol 111 (03) ◽

pp. 124-129

Author(s):

Markus Böhm ◽

Klaus Erlach ◽

Thomas Bauernhansl

Keyword(s):

Machine Learning ◽

Complex Systems ◽

Decision Trees ◽

Real Time ◽

Personal Experience ◽

Shop Floor ◽

Maschinelles Lernen ◽

Classification Methods ◽

Digital Models ◽

High Expenditure

Prognosen bilden oft die Grundlage für Entscheidungen in der Produktion. Heute werden solche Voraussagen meist erfahrungs- oder modellbasiert getroffen. Bei komplexen Systemen stößt das an die Grenzen der Zuverlässigkeit oder ist mit hohem zeitlichen Aufwand verbunden. Klassierungsmethoden des Maschinellen Lernens versprechen dafür Lösungen. Automatisch erstellte Entscheidungsbäume können eine Möglichkeit sein, echtzeitnah Prognosen für Kennzahlen in der Produktion zu erstellen.   Forecasts often form the basis for decisions on the shop floor. Today, forecasts in production are mostly derived from personal experience or digital models. With complex systems, this approach reaches the limits of reliability or is associated with a high expenditure of time. Classification methods of machine learning promise solutions for this. Automatically generated decision trees can be a possibility to generate real-time forecasts for key figures in production.

Download Full-text

Targeting Prospective Customers: Robustness of Machine-Learning Methods to Typical Data Challenges

Management Science ◽

10.1287/mnsc.2019.3308 ◽

2020 ◽

Vol 66 (6) ◽

pp. 2495-2522 ◽

Cited By ~ 2

Author(s):

Duncan Simester ◽

Artem Timoshenko ◽

Spyros I. Zoumpoulis

Keyword(s):

Machine Learning ◽

Field Experiment ◽

Large Scale ◽

Field Experiments ◽

Training Data ◽

Classification Methods ◽

Learning Methods ◽

Model Driven ◽

Machine Learning Methods

We investigate how firms can use the results of field experiments to optimize the targeting of promotions when prospecting for new customers. We evaluate seven widely used machine-learning methods using a series of two large-scale field experiments. The first field experiment generates a common pool of training data for each of the seven methods. We then validate the seven optimized policies provided by each method together with uniform benchmark policies in a second field experiment. The findings not only compare the performance of the targeting methods, but also demonstrate how well the methods address common data challenges. Our results reveal that when the training data are ideal, model-driven methods perform better than distance-driven methods and classification methods. However, the performance advantage vanishes in the presence of challenges that affect the quality of the training data, including the extent to which the training data captures details of the implementation setting. The challenges we study are covariate shift, concept shift, information loss through aggregation, and imbalanced data. Intuitively, the model-driven methods make better use of the information available in the training data, but the performance of these methods is more sensitive to deterioration in the quality of this information. The classification methods we tested performed relatively poorly. We explain the poor performance of the classification methods in our setting and describe how the performance of these methods could be improved. This paper was accepted by Matthew Shum, marketing.

Download Full-text