A combined nonparametric approach to feature selection and binary decision tree design

1980 ◽  
Vol 12 (5) ◽  
pp. 313-317 ◽  
Author(s):  
E.M. Rounds
2021 ◽  
Author(s):  
Nicodemus Nzoka Maingi ◽  
Ismail Ateya Lukandu ◽  
Matilu MWAU

Abstract BackgroundThe disease outbreak management operations of most countries (notably Kenya) present numerous novel ideas of how to best make use of notifiable disease data to effect proactive interventions. Notifiable disease data is reported, aggregated and variously consumed. Over the years, there has been a deluge of notifiable disease data and the challenge for notifiable disease data management entities has been how to objectively and dynamically aggregate such data in a manner such as to enable the efficient consumption to inform relevant mitigation measures. Various models have been explored, tried and tested with varying results; some purely mathematical and statistical, others quasi-mathematical cum software model-driven.MethodsOne of the tools that has been explored is Artificial Intelligence (AI). AI is a technique that enables computers to intelligently perform and mimic actions and tasks usually reserved for human experts. AI presents a great opportunity for redefining how the data is more meaningfully processed and packaged. This research explores AI’s Machine Learning (ML) theory as a differentiator in the crunching of notifiable disease data and adding perspective. An algorithm has been designed to test different notifiable disease outbreak data cases, a shift to managing disease outbreaks via the symptoms they generally manifest. Each notifiable disease is broken down into a set of symptoms, dubbed symptom burden variables, and consequently categorized into eight clusters: Bodily, Gastro-Intestinal, Muscular, Nasal, Pain, Respiratory, Skin, and finally, Other Symptom Clusters. ML’s decision tree theory has been utilized in the determination of the entropies and information gains of each symptom cluster based on select test data sets.ResultsOnce the entropies and information gains have been determined, the information gain variables are then ranked in descending order; from the variables with the highest information gains to those with the lowest, thereby giving a clear-cut criteria of how the variables are ordered. The ranked variables are then utilized in the construction of a binary decision tree, which graphically and structurally represents the variables. Should any variables have a tie in the information gain rankings, such are given equal importance in the construction of the binary decision-tree. From the presented data, the computed information gains are ordered as; Gastro-Intestinal, Bodily, Pain, Skin, Respiratory, Others. Muscular, and finally Nasal Symptoms respectively. The corresponding binary decision tree is then constructed.ConclusionsThe algorithm successfully singles out the disease burden variable(s) that are most critical as the point of diagnostic focus to enable the relevant authorities take the necessary, informed interventions. This algorithm provides a good basis for a country’s localized diagnostic activities driven by data from the reported notifiable disease cases. The algorithm presents a dynamic mechanism that can be used to analyze and aggregate any notifiable disease data set, meaning that the algorithm is not fixated or locked on any particular data set.


Entropy ◽  
2019 ◽  
Vol 21 (4) ◽  
pp. 334 ◽  
Author(s):  
Georgios Feretzakis ◽  
Dimitris Kalles ◽  
Vassilios Verykios

The sharing of data among organizations has become an increasingly common procedure in several areas like banking, electronic commerce, advertising, marketing, health, and insurance sectors. However, any organization will most likely try to keep some patterns hidden once it shares its datasets with others. This article focuses on preserving the privacy of sensitive patterns when inducing decision trees. We propose a heuristic approach that can be used to hide a certain rule which can be inferred from the derivation of a binary decision tree. This hiding method is preferred over other heuristic solutions like output perturbation or cryptographic techniques—which limit the usability of the data—since the raw data itself is readily available for public use. This method can be used to hide decision tree rules with a minimum impact on all other rules derived.


2013 ◽  
Vol 99 ◽  
pp. 87-97 ◽  
Author(s):  
Amioy Kumar ◽  
M. Hanmandlu ◽  
H.M. Gupta

Sign in / Sign up

Export Citation Format

Share Document