scholarly journals A branch & bound algorithm to determine optimal bivariate splits for oblique decision tree induction

Author(s):  
Ferdinand Bollwein ◽  
Stephan Westphal

AbstractUnivariate decision tree induction methods for multiclass classification problems such as CART, C4.5 and ID3 continue to be very popular in the context of machine learning due to their major benefit of being easy to interpret. However, as these trees only consider a single attribute per node, they often get quite large which lowers their explanatory value. Oblique decision tree building algorithms, which divide the feature space by multidimensional hyperplanes, often produce much smaller trees but the individual splits are hard to interpret. Moreover, the effort of finding optimal oblique splits is very high such that heuristics have to be applied to determine local optimal solutions. In this work, we introduce an effective branch and bound procedure to determine global optimal bivariate oblique splits for concave impurity measures. Decision trees based on these bivariate oblique splits remain fairly interpretable due to the restriction to two attributes per split. The resulting trees are significantly smaller and more accurate than their univariate counterparts due to their ability of adapting better to the underlying data and capturing interactions of attribute pairs. Moreover, our evaluation shows that our algorithm even outperforms algorithms based on heuristically obtained multivariate oblique splits despite the fact that we are focusing on two attributes only.

2021 ◽  
Vol 2021 ◽  
pp. 1-18
Author(s):  
Wan-Wei Fan ◽  
Ching-Hung Lee

This paper proposes a method to treat the classification of imbalanced data by adding noise to the feature space of convolutional neural network (CNN) without changing a data set (ratio of majority and minority data). Besides, a hybrid loss function of crossentropy and KL divergence is proposed. The proposed approach can improve the accuracy of minority class in the testing data. In addition, a simple design method for selecting structure of CNN is first introduced and then, we add noise in feature space of CNN to obtain proper features by a training process and to improve the classification results. From comparison results, we can find that the proposed method can extract the suitable features to improve the accuracy of minority class. Finally, illustrated examples of multiclass classification problems and the corresponding discussion in balance ratio are presented. Our approach performs well with smaller network structure compared with other deep models. In addition, the performance is improved over 40% in defective accuracy by adding noise approach. Finally, the accuracy is higher than 96%; even the imbalanced ratio (IR) is one hundred.


2013 ◽  
Vol 21 (4) ◽  
pp. 659-684 ◽  
Author(s):  
Rodrigo C. Barros ◽  
Márcio P. Basgalupp ◽  
André C. P. L. F. de Carvalho ◽  
Alex A. Freitas

This study reports the empirical analysis of a hyper-heuristic evolutionary algorithm that is capable of automatically designing top-down decision-tree induction algorithms. Top-down decision-tree algorithms are of great importance, considering their ability to provide an intuitive and accurate knowledge representation for classification problems. The automatic design of these algorithms seems timely, given the large literature accumulated over more than 40 years of research in the manual design of decision-tree induction algorithms. The proposed hyper-heuristic evolutionary algorithm, HEAD-DT, is extensively tested using 20 public UCI datasets and 10 microarray gene expression datasets. The algorithms automatically designed by HEAD-DT are compared with traditional decision-tree induction algorithms, such as C4.5 and CART. Experimental results show that HEAD-DT is capable of generating algorithms which are significantly more accurate than C4.5 and CART.


Author(s):  
N.V. Rudakov ◽  
N.A. Penyevskaya ◽  
D.A. Saveliev ◽  
S.A. Rudakova ◽  
C.V. Shtrek ◽  
...  

Research objective. Differentiation of natural focal areas of Western Siberia by integral incidence rates of tick-borne infectious diseases for determination of the strategy and tactics of their comprehensive prevention. Materials and methods. A retrospective analysis of official statistics for the period 2002-2018 for eight sub-federal units in the context of administrative territories was carried out. The criteria of differentiation were determined by means of three evaluation scales, including long-term mean rates of tick-borne encephalitis, tick-borne borreliosis, and Siberian tick-borne typhus. As a scale gradation tool, we used the number of sample elements between the confidence boundaries of the median. The integral assessment was carried out by the sum of points corresponding to the incidence rates for each of the analyzed infections. Results. The areas of low, medium, above average, high and very high risk of tick-borne infectious diseases were determined. Recommendations on the choice of prevention strategy and tactics were given. In areas of very high and high incidence rates, a combination of population-based and individual prevention strategies is preferable while in other areas a combination of high-risk and individual strategies is recommended. Discussion. Epidemiologic zoning should be the basis of a risk-based approach to determining optimal volumes and directions of preventive measures against natural focal infections. It is necessary to improve the means and methods of determining the individual risk of getting infected and developing tick-borne infectious diseases in case of bites, in view of mixed infection of vectors, as well as methods of post-exposure disease prevention (preventive therapy).


1999 ◽  
Vol 17 (11) ◽  
pp. 3603-3611 ◽  
Author(s):  
Dympna Waldron ◽  
Ciaran A. O'Boyle ◽  
Michael Kearney ◽  
Michael Moriarty ◽  
Desmond Carney

PURPOSE: Despite the increasing importance of assessing quality of life (QoL) in patients with advanced cancer, relatively little is known about individual patient's perceptions of the issues contributing to their QoL. The Schedule for the Evaluation of Individual Quality of Life (SEIQoL) and the shorter SEIQoL–Direct Weighting (SEIQoL-DW) assess individualized QoL using a semistructured interview technique. Here we report findings from the first administration of the SEIQoL and SEIQoL-DW to patients with advanced incurable cancer. PATIENTS AND METHODS: QoL was assessed on a single occasion using the SEIQoL and SEIQoL-DW in 80 patients with advanced incurable cancer. RESULTS: All patients were able to complete the SEIQoL-DW, and 78% completed the SEIQoL. Of a possible score of 100, the median QoL global score was as follows: SEIQoL, 61 (range, 24 to 94); SEIQoL-DW, 60.5 (range, 6 to 95). Psychometric data for SEIQoL indicated very high levels of internal consistency (median r = .90) and internal validity (median R2 = 0.88). Patients' judgments of their QoL were unique to the individual. Family concerns were almost universally rated as more important than health, the difference being significant when measured using the SEIQoL-DW (P = .002). CONCLUSION: Patients with advanced incurable cancer were very good judges of their QoL, and many patients rated their QoL as good. Judgments were highly individual, with very high levels of consistency and validity. The primacy given to health in many QoL questionnaires may be questioned in this population. The implications of these findings are discussed with regard to clinical assessment and advance directives.


2021 ◽  
Vol 11 (9) ◽  
pp. 4280
Author(s):  
Iurii Katser ◽  
Viacheslav Kozitsin ◽  
Victor Lobachev ◽  
Ivan Maksimov

Offline changepoint detection (CPD) algorithms are used for signal segmentation in an optimal way. Generally, these algorithms are based on the assumption that signal’s changed statistical properties are known, and the appropriate models (metrics, cost functions) for changepoint detection are used. Otherwise, the process of proper model selection can become laborious and time-consuming with uncertain results. Although an ensemble approach is well known for increasing the robustness of the individual algorithms and dealing with mentioned challenges, it is weakly formalized and much less highlighted for CPD problems than for outlier detection or classification problems. This paper proposes an unsupervised CPD ensemble (CPDE) procedure with the pseudocode of the particular proposed ensemble algorithms and the link to their Python realization. The approach’s novelty is in aggregating several cost functions before the changepoint search procedure running during the offline analysis. The numerical experiment showed that the proposed CPDE outperforms non-ensemble CPD procedures. Additionally, we focused on analyzing common CPD algorithms, scaling, and aggregation functions, comparing them during the numerical experiment. The results were obtained on the two anomaly benchmarks that contain industrial faults and failures—Tennessee Eastman Process (TEP) and Skoltech Anomaly Benchmark (SKAB). One of the possible applications of our research is the estimation of the failure time for fault identification and isolation problems of the technical diagnostics.


2021 ◽  
Vol 54 (1) ◽  
pp. 1-38
Author(s):  
Víctor Adrián Sosa Hernández ◽  
Raúl Monroy ◽  
Miguel Angel Medina-Pérez ◽  
Octavio Loyola-González ◽  
Francisco Herrera

Experts from different domains have resorted to machine learning techniques to produce explainable models that support decision-making. Among existing techniques, decision trees have been useful in many application domains for classification. Decision trees can make decisions in a language that is closer to that of the experts. Many researchers have attempted to create better decision tree models by improving the components of the induction algorithm. One of the main components that have been studied and improved is the evaluation measure for candidate splits. In this article, we introduce a tutorial that explains decision tree induction. Then, we present an experimental framework to assess the performance of 21 evaluation measures that produce different C4.5 variants considering 110 databases, two performance measures, and 10× 10-fold cross-validation. Furthermore, we compare and rank the evaluation measures by using a Bayesian statistical analysis. From our experimental results, we present the first two performance rankings in the literature of C4.5 variants. Moreover, we organize the evaluation measures into two groups according to their performance. Finally, we introduce meta-models that automatically determine the group of evaluation measures to produce a C4.5 variant for a new database and some further opportunities for decision tree models.


Sign in / Sign up

Export Citation Format

Share Document