scholarly journals An Optimized Feature Regularization in Boosted Decision Tree

We put forward a tree regularization, which empowers numerous tree models to do feature collection effectively. The type thought of the regularization system be to punish choosing another feature intended for split when its gain is like the features utilized in past splits. This paper utilized standard data set as unique discrete test data, and the entropy and information gain of each trait of the data was determined to actualize the classification of data. Boosted DT are between the most prominent learning systems being used nowadays. Likewise, this paper accomplished an optimized structure of the decision tree, which is streamlined for improving the efficiency of the algorithm on the reason of guaranteeing low error rate which was at a similar dimension as other classification algorithms

We put forward a tree regularization, which empowers numerous tree models to do feature collection effectively. The type thought of the regularization system be to punish choosing another feature intended for split when its gain is like the features utilized in past splits. This paper utilized standard data set as unique discrete test data, and the entropy and information gain of each trait of the data was determined to actualize the classification of data. Boosted DT are between the most prominent learning systems being used nowadays. Likewise, this paper accomplished an optimized structure of the decision tree, which is streamlined for improving the efficiency of the algorithm on the reason of guaranteeing low error rate which was at a similar dimension as other classification algorithms


2021 ◽  
Author(s):  
Nicodemus Nzoka Maingi ◽  
Ismail Ateya Lukandu ◽  
Matilu MWAU

Abstract BackgroundThe disease outbreak management operations of most countries (notably Kenya) present numerous novel ideas of how to best make use of notifiable disease data to effect proactive interventions. Notifiable disease data is reported, aggregated and variously consumed. Over the years, there has been a deluge of notifiable disease data and the challenge for notifiable disease data management entities has been how to objectively and dynamically aggregate such data in a manner such as to enable the efficient consumption to inform relevant mitigation measures. Various models have been explored, tried and tested with varying results; some purely mathematical and statistical, others quasi-mathematical cum software model-driven.MethodsOne of the tools that has been explored is Artificial Intelligence (AI). AI is a technique that enables computers to intelligently perform and mimic actions and tasks usually reserved for human experts. AI presents a great opportunity for redefining how the data is more meaningfully processed and packaged. This research explores AI’s Machine Learning (ML) theory as a differentiator in the crunching of notifiable disease data and adding perspective. An algorithm has been designed to test different notifiable disease outbreak data cases, a shift to managing disease outbreaks via the symptoms they generally manifest. Each notifiable disease is broken down into a set of symptoms, dubbed symptom burden variables, and consequently categorized into eight clusters: Bodily, Gastro-Intestinal, Muscular, Nasal, Pain, Respiratory, Skin, and finally, Other Symptom Clusters. ML’s decision tree theory has been utilized in the determination of the entropies and information gains of each symptom cluster based on select test data sets.ResultsOnce the entropies and information gains have been determined, the information gain variables are then ranked in descending order; from the variables with the highest information gains to those with the lowest, thereby giving a clear-cut criteria of how the variables are ordered. The ranked variables are then utilized in the construction of a binary decision tree, which graphically and structurally represents the variables. Should any variables have a tie in the information gain rankings, such are given equal importance in the construction of the binary decision-tree. From the presented data, the computed information gains are ordered as; Gastro-Intestinal, Bodily, Pain, Skin, Respiratory, Others. Muscular, and finally Nasal Symptoms respectively. The corresponding binary decision tree is then constructed.ConclusionsThe algorithm successfully singles out the disease burden variable(s) that are most critical as the point of diagnostic focus to enable the relevant authorities take the necessary, informed interventions. This algorithm provides a good basis for a country’s localized diagnostic activities driven by data from the reported notifiable disease cases. The algorithm presents a dynamic mechanism that can be used to analyze and aggregate any notifiable disease data set, meaning that the algorithm is not fixated or locked on any particular data set.


PeerJ ◽  
2017 ◽  
Vol 5 ◽  
pp. e3092 ◽  
Author(s):  
Shih-Hsiung Liang ◽  
Bruno Andreas Walther ◽  
Bao-Sen Shieh

Background Biological invasions have become a major threat to biodiversity, and identifying determinants underlying success at different stages of the invasion process is essential for both prevention management and testing ecological theories. To investigate variables associated with different stages of the invasion process in a local region such as Taiwan, potential problems using traditional parametric analyses include too many variables of different data types (nominal, ordinal, and interval) and a relatively small data set with too many missing values. Methods We therefore used five decision tree models instead and compared their performance. Our dataset contains 283 exotic bird species which were transported to Taiwan; of these 283 species, 95 species escaped to the field successfully (introduction success); of these 95 introduced species, 36 species reproduced in the field of Taiwan successfully (establishment success). For each species, we collected 22 variables associated with human selectivity and species traits which may determine success during the introduction stage and establishment stage. For each decision tree model, we performed three variable treatments: (I) including all 22 variables, (II) excluding nominal variables, and (III) excluding nominal variables and replacing ordinal values with binary ones. Five performance measures were used to compare models, namely, area under the receiver operating characteristic curve (AUROC), specificity, precision, recall, and accuracy. Results The gradient boosting models performed best overall among the five decision tree models for both introduction and establishment success and across variable treatments. The most important variables for predicting introduction success were the bird family, the number of invaded countries, and variables associated with environmental adaptation, whereas the most important variables for predicting establishment success were the number of invaded countries and variables associated with reproduction. Discussion Our final optimal models achieved relatively high performance values, and we discuss differences in performance with regard to sample size and variable treatments. Our results showed that, for both the establishment model and introduction model, the number of invaded countries was the most important or second most important determinant, respectively. Therefore, we suggest that future success for introduction and establishment of exotic birds may be gauged by simply looking at previous success in invading other countries. Finally, we found that species traits related to reproduction were more important in establishment models than in introduction models; importantly, these determinants were not averaged but either minimum or maximum values of species traits. Therefore, we suggest that in addition to averaged values, reproductive potential represented by minimum and maximum values of species traits should be considered in invasion studies.


Author(s):  
S. Neelakandan ◽  
D. Paulraj

People communicate their views, arguments and emotions about their everyday life on social media (SM) platforms (e.g. Twitter and Facebook). Twitter stands as an international micro-blogging service that features a brief message called tweets. Freestyle writing, incorrect grammar, typographical errors and abbreviations are some noises that occur in the text. Sentiment analysis (SA) centered on a tweet posted by the user, and also opinion mining (OM) of the customers review is another famous research topic. The texts are gathered from users’ tweets by means of OM and automatic-SA centered on ternary classifications, namely positive, neutral and negative. It is very challenging for the researchers to ascertain sentiments as a result of its limited size, misspells, unstructured nature, abbreviations and slangs for Twitter data. This paper, with the aid of the Gradient Boosted Decision Tree classifier (GBDT), proposes an efficient SA and Sentiment Classification (SC) of Twitter data. Initially, the twitter data undergoes pre-processing. Next, the pre-processed data is processed using HDFS MapReduce. Now, the features are extracted from the processed data, and then efficient features are selected using the Improved Elephant Herd Optimization (I-EHO) technique. Now, score values are calculated for each of those chosen features and given to the classifier. At last, the GBDT classifier classifies the data as negative, positive, or neutral. Experiential results are analyzed and contrasted with the other conventional techniques to show the highest performance of the proposed method.


2008 ◽  
Vol 12 (3) ◽  
Author(s):  
Jozef Zurada ◽  
Peng C. Lam

For many years lenders have been using traditional statistical techniques such as logistic regression and discriminant analysis to more precisely distinguish between creditworthy customers who are granted loans and non-creditworthy customers who are denied loans. More recently new machine learning techniques such as neural networks, decision trees, and support vector machines have been successfully employed to classify loan applicants into those who are likely to pay a loan off or default upon a loan. Accurate classification is beneficial to lenders in terms of increased financial profits or reduced losses and to loan applicants who can avoid overcommitment. This paper examines a historical data set from consumer loans issued by a German bank to individuals whom the bank considered to be qualified customers. The data set consists of the financial attributes of each customer and includes a mixture of loans that the customers paid off or defaulted upon. The paper examines and compares the classification accuracy rates of three decision tree techniques as well as analyzes their ability to generate easy to understand rules.


Author(s):  
Latifa Nass ◽  
Stephen Swift ◽  
Ammar Al Dallal

Most of the healthcare organizations and medical research institutions store their patient’s data digitally for future references and for planning their future treatments. This heterogeneous medical dataset is very difficult to analyze due to its complexity and volume of data, in addition to having missing values and noise which makes this mining a tedious task. Efficient classification of medical dataset is a major data mining problem then and now. Diagnosis, prediction of diseases and the precision of results can be improved if relationships and patterns from these complex medical datasets are extracted efficiently. This paper analyses some of the major classification algorithms such as C4.5 ( J48), SMO, Naïve Bayes, KNN Classification algorithms and Random Forest and the performance of these algorithms are compared using WEKA. Performance evaluation of these algorithms is based on Accuracy, Sensitivity and Specificity and Error rate. The medical data set used in this study are Heart-Statlog Medical Data Set which holds medical data related to heart disease and Pima Diabetes Dataset which holds data related to Diabetics. This study contributes in finding the most suitable algorithm for classifying medical data and also reveals the importance of preprocessing in improving the classification performance. Comparative study of various performances of machine learning algorithms is done through graphical representation of the results. Keywords: Data Mining, Health Care, Classification Algorithms, Accuracy, Sensitivity, Specificity, Error Rate


In the previous two decades, systems have encountered huge development that has accelerate a move in processing situations from brought together PC frameworks to organize data frameworks. An enormous volume of significant data, for example, individual profiles and Visa data is circulated and moved through systems. Consequently, arrange security has turned out to be a higher priority than at any other time. Be that as it may, given open and complex interconnected system frameworks, it is hard to build up a safe systems administration condition. Gatecrashers jeopardize framework security by slamming administrations, changing basic information, and taking significant data. In Information Security, interruption location is the demonstration of recognizing activities that endeavor to bargain the classification, uprightness or accessibility of an asset. It assumes a significant job in assault discovery, security check and system examine. This paper exhibits a refreshed choice tree based method for the characterization of interruption information. KDD 99 informational index is utilized for test work.


Sign in / Sign up

Export Citation Format

Share Document