scholarly journals Instance Reduction for Avoiding Overfitting in Decision Trees

2021 ◽  
Vol 30 (1) ◽  
pp. 438-459
Author(s):  
Asma’ Amro ◽  
Mousa Al-Akhras ◽  
Khalil El Hindi ◽  
Mohamed Habib ◽  
Bayan Abu Shawar

Abstract Decision trees learning is one of the most practical classification methods in machine learning, which is used for approximating discrete-valued target functions. However, they may overfit the training data, which limits their ability to generalize to unseen instances. In this study, we investigated the use of instance reduction techniques to smooth the decision boundaries before training the decision trees. Noise filters such as ENN, RENN, and ALLKNN remove noisy instances while DROP3 and DROP5 may remove genuine instances. Extensive empirical experiments were conducted on 13 benchmark datasets from UCI machine learning repository with and without intentionally introduced noise. Empirical results show that eliminating border instances improves the classification accuracy of decision trees and reduces the tree size, which reduces the training and classification times. In datasets without intentionally added noise, applying noise filters without the use of the built-in Reduced Error Pruning gave the best classification accuracy. ENN, RENN, and ALLKNN outperformed decision trees learning without pruning in 9, 9, and 8 out of 13 datasets, respectively. The datasets reduced using ENN and RENN without built-in pruning were more effective when noise was intentionally introduced in different ratios.

2020 ◽  
Author(s):  
Yosoon Choi ◽  
Jieun Baek ◽  
Jangwon Suh ◽  
Sung-Min Kim

<p>In this study, we proposed a method to utilize a multi-sensor Unmanned Aerial System (UAS) for exploration of hydrothermal alteration zones. This study selected an area (10m × 20m) composed mainly of the andesite and located on the coast, with wide outcrops and well-developed structural and mineralization elements. Multi-sensor (visible, multispectral, thermal, magnetic) data were acquired in the study area using UAS, and were studied using machine learning techniques. For utilizing the machine learning techniques, we applied the stratified random method to sample 1000 training data in the hydrothermal zone and 1000 training data in the non-hydrothermal zone identified through the field survey. The 2000 training data sets created for supervised learning were first classified into 1500 for training and 500 for testing. Then, 1500 for training were classified into 1200 for training and 300 for validation. The training and validation data for machine learning were generated in five sets to enable cross-validation. Five types of machine learning techniques were applied to the training data sets: k-Nearest Neighbors (k-NN), Decision Tree (DT), Random Forest (RF), Support Vector Machine (SVM), and Deep Neural Network (DNN). As a result of integrated analysis of multi-sensor data using five types of machine learning techniques, RF and SVM techniques showed high classification accuracy of about 90%. Moreover, performing integrated analysis using multi-sensor data showed relatively higher classification accuracy in all five machine learning techniques than analyzing magnetic sensing data or single optical sensing data only.</p>


Crystals ◽  
2021 ◽  
Vol 11 (10) ◽  
pp. 1218
Author(s):  
Natasha Dropka ◽  
Klaus Böttcher ◽  
Martin Holena

The aim of this study was to assess the ability of the various data mining and supervised machine learning techniques: correlation analysis, k-means clustering, principal component analysis and decision trees (regression and classification), to derive, optimize and understand the factors influencing VGF-GaAs growth. Training data were generated by Computational Fluid Dynamics (CFD) simulations and consisted of 130 datasets with 6 inputs (growth rate and power of 5 heaters) and 5 outputs (interface position and deflection, and temperatures at various positions in GaAs). Data mining results confirmed a good dispersion of the training data without the feasibility of a dimensionality reduction. Data clustering was observed in relation to the position of the crystallization front relative to the side heaters. Based on the statistical performance criteria and training results, decision trees identified the most decisive inputs and their ranges for a favorable interface shape and to keep GaAs temperature beyond limits for heavy arsenic evaporation. Decision trees are a recommendable machine learning technique with short training times and acceptable predictive accuracy based on small volume of CFD training data, capable of providing guidelines for understanding the crystal growth process, which is a prerequisite for the growth of low-cost, high-quality bulk crystals.


2020 ◽  
Vol 44 (7-8) ◽  
pp. 499-514
Author(s):  
Yi Zheng ◽  
Hyunjung Cheon ◽  
Charles M. Katz

This study explores advanced techniques in machine learning to develop a short tree-based adaptive classification test based on an existing lengthy instrument. A case study was carried out for an assessment of risk for juvenile delinquency. Two unique facts of this case are (a) the items in the original instrument measure a large number of distinctive constructs; (b) the target outcomes are of low prevalence, which renders imbalanced training data. Due to the high dimensionality of the items, traditional item response theory (IRT)-based adaptive testing approaches may not work well, whereas decision trees, which are developed in the machine learning discipline, present as a promising alternative solution for adaptive tests. A cross-validation study was carried out to compare eight tree-based adaptive test constructions with five benchmark methods using data from a sample of 3,975 subjects. The findings reveal that the best-performing tree-based adaptive tests yielded better classification accuracy than the benchmark method IRT scoring with optimal cutpoints, and yielded comparable or better classification accuracy than the best benchmark method, random forest with balanced sampling. The competitive classification accuracy of the tree-based adaptive tests also come with an over 30-fold reduction in the length of the instrument, only administering between 3 to 6 items to any individual. This study suggests that tree-based adaptive tests have an enormous potential when used to shorten instruments that measure a large variety of constructs.


Energies ◽  
2021 ◽  
Vol 14 (7) ◽  
pp. 1945
Author(s):  
Icksung Kim ◽  
Woohyun Kim

Fault detection and diagnosis (FDD) systems enable high cost savings and energy savings that could have economic and environmental impact. This study aims to develop and validate a data-driven FDD system for a chiller. The system uses historical operation data to capture quantitative correlations among system variables. This study evaluated the effectiveness and robustness of eight FDD classification methods based on the experimental data of the chiller (the ASHRAE 1043-RP project). The training data used for the FDD system is classified into four cases. Moreover, true and false positive rates are used to characterize the performance of the classification methods. The results show that local fault is not significantly sensitive to training data, and shows high classification accuracy for all cases. The system fault has a significant effect on the amount of data and the severity levels on the classification accuracy.


Author(s):  
Nina Narodytska ◽  
Alexey Ignatiev ◽  
Filipe Pereira ◽  
Joao Marques-Silva

Explanations of machine learning (ML) predictions are of fundamental importance in different settings. Moreover, explanations should be succinct, to enable easy understanding by humans.  Decision trees represent an often used approach for developing explainable ML models, motivated by the natural mapping between decision tree paths and rules. Clearly, smaller trees correlate well with smaller rules, and so one  challenge is to devise solutions for computing smallest size decision trees given training data. Although simple to formulate, the computation of smallest size decision trees turns out to be an extremely challenging computational problem, for which no practical solutions are known. This paper develops a SAT-based model for computing smallest-size decision trees given training data. In sharp contrast with past work, the proposed SAT model is shown to scale for publicly available datasets of practical interest.


Author(s):  
Mohannad Elhamod ◽  
Kelly M. Diamond ◽  
A. Murat Maga ◽  
Yasin Bakis ◽  
Henry L. Bart ◽  
...  

AbstractFish species classification is an important task that is the foundation of many industrial, commercial, ecological, and scientific applications involving the study of fish distributions, dynamics, and evolution.While conventional approaches for this task use off-the-shelf machine learning (ML) methods such as existing Convolutional Neural Network (ConvNet) architectures, there is an opportunity to inform the ConvNet architecture using our knowledge of biological hierarchies among taxonomic classes.In this work, we propose infusing phylogenetic information into the model’s training to guide its structure and relationships among the extracted features. In our extensive experimental analyses, the proposed model, named Hierarchy-Guided Neural Network (HGNN), outperforms conventional ConvNet models in terms of classification accuracy under scarce training data conditions.We also observe that HGNN shows better resilience to adversarial occlusions, when some of the most informative patch regions of the image are intentionally blocked and their effect on classification accuracy is studied.


2019 ◽  
Vol 25 (5) ◽  
pp. 18-24
Author(s):  
Predrag Teodorovic ◽  
Rastislav Struharik

This paper presents a hardware accelerator for sparse decision trees intended for FPGA applications. To the best of authors’ knowledge, this is the first accelerator of this type. Beside the hardware accelerator itself, a novel algorithm for induction of sparse decision trees is also presented. Sparse decision trees can be attractive because they require less memory resources and can be more efficiently processed using specialized hardware compared to traditional oblique decision trees. This can be of significant interest, particularly, in the edge-based applications, where memory and compute resources as well as power consumption are severely constrained. The performance of the proposed sparse decision tree induction algorithm as well as developed hardware accelerator are studied using standard benchmark datasets obtained from the UCI Machine Learning Repository database. The results of the experimental study indicate that the proposed algorithm and hardware accelerator are very favourably compared with some of the existing solutions.


Real time crash predictor system is determining frequency of crashes and also severity of crashes. Nowadays machine learning based methods are used to predict the total number of crashes. In this project, prediction accuracy of machine learning algorithms like Decision tree (DT), K-nearest neighbors (KNN), Random forest (RF), Logistic Regression (LR) are evaluated. Performance analysis of these classification methods are evaluated in terms of accuracy. Dataset included for this project is obtained from 49 states of US and 27 states of India which contains 2.25 million US accident crash records and 1.16 million crash records respectively. Results prove that classification accuracy obtained from Random Forest (RF) is96% compared to other classification methods.


2021 ◽  
Vol 111 (03) ◽  
pp. 124-129
Author(s):  
Markus Böhm ◽  
Klaus Erlach ◽  
Thomas Bauernhansl

Prognosen bilden oft die Grundlage für Entscheidungen in der Produktion. Heute werden solche Voraussagen meist erfahrungs- oder modellbasiert getroffen. Bei komplexen Systemen stößt das an die Grenzen der Zuverlässigkeit oder ist mit hohem zeitlichen Aufwand verbunden. Klassierungsmethoden des Maschinellen Lernens versprechen dafür Lösungen. Automatisch erstellte Entscheidungsbäume können eine Möglichkeit sein, echtzeitnah Prognosen für Kennzahlen in der Produktion zu erstellen.   Forecasts often form the basis for decisions on the shop floor. Today, forecasts in production are mostly derived from personal experience or digital models. With complex systems, this approach reaches the limits of reliability or is associated with a high expenditure of time. Classification methods of machine learning promise solutions for this. Automatically generated decision trees can be a possibility to generate real-time forecasts for key figures in production.


2020 ◽  
Vol 66 (6) ◽  
pp. 2495-2522 ◽  
Author(s):  
Duncan Simester ◽  
Artem Timoshenko ◽  
Spyros I. Zoumpoulis

We investigate how firms can use the results of field experiments to optimize the targeting of promotions when prospecting for new customers. We evaluate seven widely used machine-learning methods using a series of two large-scale field experiments. The first field experiment generates a common pool of training data for each of the seven methods. We then validate the seven optimized policies provided by each method together with uniform benchmark policies in a second field experiment. The findings not only compare the performance of the targeting methods, but also demonstrate how well the methods address common data challenges. Our results reveal that when the training data are ideal, model-driven methods perform better than distance-driven methods and classification methods. However, the performance advantage vanishes in the presence of challenges that affect the quality of the training data, including the extent to which the training data captures details of the implementation setting. The challenges we study are covariate shift, concept shift, information loss through aggregation, and imbalanced data. Intuitively, the model-driven methods make better use of the information available in the training data, but the performance of these methods is more sensitive to deterioration in the quality of this information. The classification methods we tested performed relatively poorly. We explain the poor performance of the classification methods in our setting and describe how the performance of these methods could be improved. This paper was accepted by Matthew Shum, marketing.


Sign in / Sign up

Export Citation Format

Share Document