A branch & bound algorithm to determine optimal bivariate splits for oblique decision tree induction

Applied Intelligence ◽

10.1007/s10489-021-02281-x ◽

2021 ◽

Author(s):

Ferdinand Bollwein ◽

Stephan Westphal

Keyword(s):

Decision Tree ◽

Feature Space ◽

Classification Problems ◽

Decision Tree Induction ◽

Single Attribute ◽

Global Optimal ◽

The Individual ◽

Tree Building ◽

Very High ◽

Multiclass Classification Problems

AbstractUnivariate decision tree induction methods for multiclass classification problems such as CART, C4.5 and ID3 continue to be very popular in the context of machine learning due to their major benefit of being easy to interpret. However, as these trees only consider a single attribute per node, they often get quite large which lowers their explanatory value. Oblique decision tree building algorithms, which divide the feature space by multidimensional hyperplanes, often produce much smaller trees but the individual splits are hard to interpret. Moreover, the effort of finding optimal oblique splits is very high such that heuristics have to be applied to determine local optimal solutions. In this work, we introduce an effective branch and bound procedure to determine global optimal bivariate oblique splits for concave impurity measures. Decision trees based on these bivariate oblique splits remain fairly interpretable due to the restriction to two attributes per split. The resulting trees are significantly smaller and more accurate than their univariate counterparts due to their ability of adapting better to the underlying data and capturing interactions of attribute pairs. Moreover, our evaluation shows that our algorithm even outperforms algorithms based on heuristically obtained multivariate oblique splits despite the fact that we are focusing on two attributes only.

Download Full-text

Classification of Imbalanced Data Using Deep Learning with Adding Noise

Journal of Sensors ◽

10.1155/2021/1735386 ◽

2021 ◽

Vol 2021 ◽

pp. 1-18

Author(s):

Wan-Wei Fan ◽

Ching-Hung Lee

Keyword(s):

Design Method ◽

Imbalanced Data ◽

Feature Space ◽

Classification Problems ◽

Data Set ◽

Minority Class ◽

Comparison Results ◽

Testing Data ◽

Multiclass Classification Problems

This paper proposes a method to treat the classification of imbalanced data by adding noise to the feature space of convolutional neural network (CNN) without changing a data set (ratio of majority and minority data). Besides, a hybrid loss function of crossentropy and KL divergence is proposed. The proposed approach can improve the accuracy of minority class in the testing data. In addition, a simple design method for selecting structure of CNN is first introduced and then, we add noise in feature space of CNN to obtain proper features by a training process and to improve the classification results. From comparison results, we can find that the proposed method can extract the suitable features to improve the accuracy of minority class. Finally, illustrated examples of multiclass classification problems and the corresponding discussion in balance ratio are presented. Our approach performs well with smaller network structure compared with other deep models. In addition, the performance is improved over 40% in defective accuracy by adding noise approach. Finally, the accuracy is higher than 96%; even the imbalanced ratio (IR) is one hundred.

Download Full-text

Automatic Design of Decision-Tree Algorithms with Evolutionary Algorithms

Evolutionary Computation ◽

10.1162/evco_a_00101 ◽

2013 ◽

Vol 21 (4) ◽

pp. 659-684 ◽

Cited By ~ 23

Author(s):

Rodrigo C. Barros ◽

Márcio P. Basgalupp ◽

André C. P. L. F. de Carvalho ◽

Alex A. Freitas

Keyword(s):

Decision Tree ◽

Evolutionary Algorithm ◽

Automatic Design ◽

Classification Problems ◽

Top Down ◽

Microarray Gene Expression ◽

Decision Tree Induction ◽

Accurate Knowledge ◽

Tree Algorithms ◽

The Empirical Analysis

This study reports the empirical analysis of a hyper-heuristic evolutionary algorithm that is capable of automatically designing top-down decision-tree induction algorithms. Top-down decision-tree algorithms are of great importance, considering their ability to provide an intuitive and accurate knowledge representation for classification problems. The automatic design of these algorithms seems timely, given the large literature accumulated over more than 40 years of research in the manual design of decision-tree induction algorithms. The proposed hyper-heuristic evolutionary algorithm, HEAD-DT, is extensively tested using 20 public UCI datasets and 10 microarray gene expression datasets. The algorithms automatically designed by HEAD-DT are compared with traditional decision-tree induction algorithms, such as C4.5 and CART. Experimental results show that HEAD-DT is capable of generating algorithms which are significantly more accurate than C4.5 and CART.

Download Full-text

Differentiation of endemic areas by incidence rates of tick-borne infectious diseases as the basis for choosing prevention strategy and tactics

ЗДОРОВЬЕ НАСЕЛЕНИЯ И СРЕДА ОБИТАНИЯ - ЗНиСО / PUBLIC HEALTH AND LIFE ENVIRONMENT ◽

10.35627/2219-5238/2019-321-12-56-61 ◽

2019 ◽

pp. 56-61

Author(s):

N.V. Rudakov ◽

N.A. Penyevskaya ◽

D.A. Saveliev ◽

S.A. Rudakova ◽

C.V. Shtrek ◽

...

Keyword(s):

High Risk ◽

Infectious Diseases ◽

Population Based ◽

Preventive Therapy ◽

Incidence Rates ◽

Prevention Strategy ◽

Individual Risk ◽

The Individual ◽

Integral Assessment ◽

Very High

Research objective. Differentiation of natural focal areas of Western Siberia by integral incidence rates of tick-borne infectious diseases for determination of the strategy and tactics of their comprehensive prevention. Materials and methods. A retrospective analysis of official statistics for the period 2002-2018 for eight sub-federal units in the context of administrative territories was carried out. The criteria of differentiation were determined by means of three evaluation scales, including long-term mean rates of tick-borne encephalitis, tick-borne borreliosis, and Siberian tick-borne typhus. As a scale gradation tool, we used the number of sample elements between the confidence boundaries of the median. The integral assessment was carried out by the sum of points corresponding to the incidence rates for each of the analyzed infections. Results. The areas of low, medium, above average, high and very high risk of tick-borne infectious diseases were determined. Recommendations on the choice of prevention strategy and tactics were given. In areas of very high and high incidence rates, a combination of population-based and individual prevention strategies is preferable while in other areas a combination of high-risk and individual strategies is recommended. Discussion. Epidemiologic zoning should be the basis of a risk-based approach to determining optimal volumes and directions of preventive measures against natural focal infections. It is necessary to improve the means and methods of determining the individual risk of getting infected and developing tick-borne infectious diseases in case of bites, in view of mixed infection of vectors, as well as methods of post-exposure disease prevention (preventive therapy).

Download Full-text

Quality-of-Life Measurement in Advanced Cancer: Assessing the Individual

Journal of Clinical Oncology ◽

10.1200/jco.1999.17.11.3603 ◽

1999 ◽

Vol 17 (11) ◽

pp. 3603-3611 ◽

Cited By ~ 130

Author(s):

Dympna Waldron ◽

Ciaran A. O'Boyle ◽

Michael Kearney ◽

Michael Moriarty ◽

Desmond Carney

Keyword(s):

Quality Of Life ◽

Advanced Cancer ◽

Internal Validity ◽

Semistructured Interview ◽

Incurable Cancer ◽

Life Measurement ◽

The Difference ◽

The Individual ◽

Very High

PURPOSE: Despite the increasing importance of assessing quality of life (QoL) in patients with advanced cancer, relatively little is known about individual patient's perceptions of the issues contributing to their QoL. The Schedule for the Evaluation of Individual Quality of Life (SEIQoL) and the shorter SEIQoL–Direct Weighting (SEIQoL-DW) assess individualized QoL using a semistructured interview technique. Here we report findings from the first administration of the SEIQoL and SEIQoL-DW to patients with advanced incurable cancer. PATIENTS AND METHODS: QoL was assessed on a single occasion using the SEIQoL and SEIQoL-DW in 80 patients with advanced incurable cancer. RESULTS: All patients were able to complete the SEIQoL-DW, and 78% completed the SEIQoL. Of a possible score of 100, the median QoL global score was as follows: SEIQoL, 61 (range, 24 to 94); SEIQoL-DW, 60.5 (range, 6 to 95). Psychometric data for SEIQoL indicated very high levels of internal consistency (median r = .90) and internal validity (median R2 = 0.88). Patients' judgments of their QoL were unique to the individual. Family concerns were almost universally rated as more important than health, the difference being significant when measured using the SEIQoL-DW (P = .002). CONCLUSION: Patients with advanced incurable cancer were very good judges of their QoL, and many patients rated their QoL as good. Judgments were highly individual, with very high levels of consistency and validity. The primacy given to health in many QoL questionnaires may be questioned in this population. The implications of these findings are discussed with regard to clinical assessment and advance directives.

Download Full-text

Unsupervised Offline Changepoint Detection Ensembles

Applied Sciences ◽

10.3390/app11094280 ◽

2021 ◽

Vol 11 (9) ◽

pp. 4280

Author(s):

Iurii Katser ◽

Viacheslav Kozitsin ◽

Victor Lobachev ◽

Ivan Maksimov

Keyword(s):

Numerical Experiment ◽

Failure Time ◽

Cost Functions ◽

Technical Diagnostics ◽

Classification Problems ◽

Tennessee Eastman Process ◽

Changepoint Detection ◽

Aggregation Functions ◽

The Individual ◽

Ensemble Algorithms

Offline changepoint detection (CPD) algorithms are used for signal segmentation in an optimal way. Generally, these algorithms are based on the assumption that signal’s changed statistical properties are known, and the appropriate models (metrics, cost functions) for changepoint detection are used. Otherwise, the process of proper model selection can become laborious and time-consuming with uncertain results. Although an ensemble approach is well known for increasing the robustness of the individual algorithms and dealing with mentioned challenges, it is weakly formalized and much less highlighted for CPD problems than for outlier detection or classification problems. This paper proposes an unsupervised CPD ensemble (CPDE) procedure with the pseudocode of the particular proposed ensemble algorithms and the link to their Python realization. The approach’s novelty is in aggregating several cost functions before the changepoint search procedure running during the offline analysis. The numerical experiment showed that the proposed CPDE outperforms non-ensemble CPD procedures. Additionally, we focused on analyzing common CPD algorithms, scaling, and aggregation functions, comparing them during the numerical experiment. The results were obtained on the two anomaly benchmarks that contain industrial faults and failures—Tennessee Eastman Process (TEP) and Skoltech Anomaly Benchmark (SKAB). One of the possible applications of our research is the estimation of the failure time for fault identification and isolation problems of the technical diagnostics.

Download Full-text

Embedded Feature Construction in Fuzzy Decision Tree Induction for High Energy Physics Classification

2020 IEEE International Conference on Systems, Man, and Cybernetics (SMC) ◽

10.1109/smc42975.2020.9283103 ◽

2020 ◽

Author(s):

Noelie Cherrier ◽

Jean-Philippe Poli ◽

Maxime Defurne ◽

Franck Sabatie

Keyword(s):

Decision Tree ◽

High Energy Physics ◽

High Energy ◽

Feature Construction ◽

Fuzzy Decision ◽

Fuzzy Decision Tree ◽

Decision Tree Induction ◽

Energy Physics

Download Full-text

A Hybrid Decision Tree-Neural Network (DT-NN) Model for Large-Scale Classification Problems

2020 IEEE International Conference on Big Data (Big Data) ◽

10.1109/bigdata50022.2020.9378061 ◽

2020 ◽

Author(s):

Jarrod Carson ◽

Kane Hollingsworth ◽

Rituparna Datta ◽

George Clark ◽

Aviv Segev

Keyword(s):

Neural Network ◽

Decision Tree ◽

Large Scale ◽

Classification Problems ◽

Scale Classification

Download Full-text

A Practical Tutorial for Decision Tree Induction

ACM Computing Surveys ◽

10.1145/3429739 ◽

2021 ◽

Vol 54 (1) ◽

pp. 1-38

Author(s):

Víctor Adrián Sosa Hernández ◽

Raúl Monroy ◽

Miguel Angel Medina-Pérez ◽

Octavio Loyola-González ◽

Francisco Herrera

Keyword(s):

Decision Tree ◽

Decision Trees ◽

Machine Learning Techniques ◽

Evaluation Measures ◽

Decision Tree Induction ◽

Learning Techniques ◽

Tree Models ◽

Evaluation Measure ◽

Main Components ◽

Support Decision Making

Experts from different domains have resorted to machine learning techniques to produce explainable models that support decision-making. Among existing techniques, decision trees have been useful in many application domains for classification. Decision trees can make decisions in a language that is closer to that of the experts. Many researchers have attempted to create better decision tree models by improving the components of the induction algorithm. One of the main components that have been studied and improved is the evaluation measure for candidate splits. In this article, we introduce a tutorial that explains decision tree induction. Then, we present an experimental framework to assess the performance of 21 evaluation measures that produce different C4.5 variants considering 110 databases, two performance measures, and 10× 10-fold cross-validation. Furthermore, we compare and rank the evaluation measures by using a Bayesian statistical analysis. From our experimental results, we present the first two performance rankings in the literature of C4.5 variants. Moreover, we organize the evaluation measures into two groups according to their performance. Finally, we introduce meta-models that automatically determine the group of evaluation measures to produce a C4.5 variant for a new database and some further opportunities for decision tree models.

Download Full-text

Discriminant functions and decision tree induction techniques for antenatal fetal risk assessment

IJCNN'01. International Joint Conference on Neural Networks. Proceedings (Cat. No.01CH37222) ◽

10.1109/ijcnn.2001.938801 ◽

2002 ◽

Author(s):

N. Guler ◽

O.T. Yildiz ◽

F. Gurgen ◽

F. Varol ◽

E. Alpaydin

Keyword(s):

Risk Assessment ◽

Decision Tree ◽

Discriminant Functions ◽

Fetal Risk ◽

Decision Tree Induction

Download Full-text

Support vector machine based decision tree for very high resolution multispectral forest mapping

2011 IEEE International Geoscience and Remote Sensing Symposium ◽

10.1109/igarss.2011.6048893 ◽

2011 ◽

Cited By ~ 7

Author(s):

Petra Krahwinkler ◽

Juergen Rossmann ◽

Bjoern Sondermann

Keyword(s):

Support Vector Machine ◽

High Resolution ◽

Decision Tree ◽

Support Vector ◽

Forest Mapping ◽

Very High

Download Full-text