scholarly journals Utilizing the Genetic Algorithm to Pruning the C4.5 Decision Tree Algorithm

2021 ◽  
Vol 9 (1) ◽  
Author(s):  
Maad M. Mijwil ◽  
Rana A. Abttan

A decision tree (DTs) is one of the most popular machine learning algorithms that divide data repeatedly to form groups or classes. It is a supervised learning algorithm that can be used on discrete or continuous data for classification or regression. The most traditional classifier in this algorithm is the C4.5 decision tree, which is the point of this research. This classifier has the advantage of building a vast data set and does not stop until it reaches the desired goal. The problem with this classifier is that there are unnecessary nodes and branches leading to overfitting. This overfitting can negatively affect the classification process. In this context, the authors suggest utilizing a genetic algorithm to prune the effect of overfitting. This dataset study consists of four datasets: IRIS, Car Evaluation, GLASS, and WINE collected from UC Irvine (UCI) machine learning repository. The experimental results have confirmed the effectiveness of the genetic algorithm in pruning the effect of overfitting on the four datasets and optimizing confidence factor (CF) of the C4.5 decision tree. The proposed method has reached about 92% accuracy in this work.

Author(s):  
Argelia B. Urbina Nájera ◽  
Jorge De la Calleja

RESUMEN  En este documento se presenta un método para mejorar el proceso de tutoría académica en la educación superior. El método incluye a identificación de las habilidades principales de los tutores de forma automática utilizando el algoritmo árboles de decisión, uno de los algoritmos más utilizados en la comunidad de aprendizaje automático para resolver problemas del mundo real con gran precisión. En el estudio, el algoritmo arboles de decisión fue capaz de identificar las habilidades y afinidades entre estudiantes y tutores. Los experimentos se llevaron a cabo utilizando un conjunto de datos de 277 estudiantes y 19 tutores, mismos que fueron seleccionados por muestreo aleatorio simple y participación voluntaria en el caso de los tutores. Los resultados preliminares muestran que los atributos más importantes para los tutores son la comunicación, la autodirección y las habilidades digitales. Al mismo tiempo, se presenta un proceso de tutoría en el que la asignación del tutor se basa en estos atributos, asumiendo que puede ayudar a fortalecer las habilidades de los estudiantes que demanda la sociedad actual. De la misma forma, el árbol de decisión obtenido se puede utilizar para agrupar a tutores y estudiantes basados en sus habilidades y afinidades personales utilizando otros algoritmos de aprendizaje automático. La aplicación del proceso de tutoría sugerido podría dar la pauta para ver el proceso de tutoría de manera individual sin vincularla a procesos de desempeño académico o deserción escolar.ABSTRACTIn this paper, we present a method for the tutoring process in order to improve academic tutoring in higher education. The method includes identifying the main skills of tutors in an automated manner using decision trees, one of the most used algorithms in the machine learning community for solving several real-world problems with high accuracy. In our study, the decision tree algorithm was able to identify those skills and personal affinities between students and tutors. Experiments were carried out using a data set of 277 students and 19 tutors, which were selected by random sampling and voluntary participation, respectively. Preliminary results show that the most important attributes for tutors are communication, self-direction and digital skills. At the same time, we introduce a tutoring process where the tutor assignment is based on these attributes, assuming that it can help to strengthen the student's skills demanded by today's society. In the same way, the decision tree obtained can be used to create cluster of tutors and clusters of students based on their personal abilities and affinities using other machine learning algorithms. The application of the suggested tutoring process could set the tone to see the tutoring process individually without linking it to processes of academic performance or school dropout.


Mathematics ◽  
2020 ◽  
Vol 8 (7) ◽  
pp. 1115
Author(s):  
Jaehak Yu ◽  
Sejin Park ◽  
Hansung Lee ◽  
Cheol-Sig Pyo ◽  
Yang Sun Lee

Recently, with the rapid change to an aging society and the increased interest in healthcare, disease prediction and management through various healthcare devices and services is attracting much attention. In particular, stroke, represented by cerebrovascular disease, is a very dangerous disease, in which death or mental and physical aftereffects are very large in adults and the elderly. The sequelae of such stroke diseases are very dangerous, because they make social and economic activities difficult. In this paper, we propose a new system to prediction and in-depth analysis stroke severity of elderly over 65 years based on the National Institutes of Health Stroke Scale (NIHSS). In addition, we use the algorithm of decision tree of C4.5, which is a methodology of prediction and analysis of machine learning techniques. The C4.5 decision trees are machine learning algorithms that provide additional in-depth rules of the execution mechanism and semantic interpretation analysis. Finally, in this paper, it is verified that the C4.5 decision tree algorithm can be used to classify and predict stroke severity, and to obtain additional NIHSS features reduction effects. Therefore, during the operation of an actual system, the proposed model uses only 13 features out of the 18 stroke scale features, including age, so that it can provide faster and more accurate service support. Experimental results show that the system enables this by reducing the patient NIH stroke scale measurement time and making the operation more efficient, with an overall accuracy, using the C4.5 decision tree algorithm, of 91.11%.


2020 ◽  
Vol 98 (Supplement_4) ◽  
pp. 126-127
Author(s):  
Lucas S Lopes ◽  
Christine F Baes ◽  
Dan Tulpan ◽  
Luis Artur Loyola Chardulo ◽  
Otavio Machado Neto ◽  
...  

Abstract The aim of this project is to compare some of the state-of-the-art machine learning algorithms on the classification of steers finished in feedlots based on performance, carcass and meat quality traits. The precise classification of animals allows for fast, real-time decision making in animal food industry, such as culling or retention of herd animals. Beef production presents high variability in its numerous carcass and beef quality traits. Machine learning algorithms and software provide an opportunity to evaluate the interactions between traits to better classify animals. Four different treatment levels of wet distiller’s grain were applied to 97 Angus-Nellore animals and used as features for the classification problem. The C4.5 decision tree, Naïve Bayes (NB), Random Forest (RF) and Multilayer Perceptron (MLP) Artificial Neural Network algorithms were used to predict and classify the animals based on recorded traits measurements, which include initial and final weights, sheer force and meat color. The top performing classifier was the C4.5 decision tree algorithm with a classification accuracy of 96.90%, while the RF, the MLP and NB classifiers had accuracies of 55.67%, 39.17% and 29.89% respectively. We observed that the final decision tree model constructed with C4.5 selected only the dry matter intake (DMI) feature as a differentiator. When DMI was removed, no other feature or combination of features was sufficiently strong to provide good prediction accuracies for any of the classifiers. We plan to investigate in a follow-up study on a significantly larger sample size, the reasons behind DMI being a more relevant parameter than the other measurements.


2020 ◽  
Vol 2020 ◽  
pp. 1-12
Author(s):  
Peter Appiahene ◽  
Yaw Marfo Missah ◽  
Ussiph Najim

The financial crisis that hit Ghana from 2015 to 2018 has raised various issues with respect to the efficiency of banks and the safety of depositors’ in the banking industry. As part of measures to improve the banking sector and also restore customers’ confidence, efficiency and performance analysis in the banking industry has become a hot issue. This is because stakeholders have to detect the underlying causes of inefficiencies within the banking industry. Nonparametric methods such as Data Envelopment Analysis (DEA) have been suggested in the literature as a good measure of banks’ efficiency and performance. Machine learning algorithms have also been viewed as a good tool to estimate various nonparametric and nonlinear problems. This paper presents a combined DEA with three machine learning approaches in evaluating bank efficiency and performance using 444 Ghanaian bank branches, Decision Making Units (DMUs). The results were compared with the corresponding efficiency ratings obtained from the DEA. Finally, the prediction accuracies of the three machine learning algorithm models were compared. The results suggested that the decision tree (DT) and its C5.0 algorithm provided the best predictive model. It had 100% accuracy in predicting the 134 holdout sample dataset (30% banks) and a P value of 0.00. The DT was followed closely by random forest algorithm with a predictive accuracy of 98.5% and a P value of 0.00 and finally the neural network (86.6% accuracy) with a P value 0.66. The study concluded that banks in Ghana can use the result of this study to predict their respective efficiencies. All experiments were performed within a simulation environment and conducted in R studio using R codes.


2020 ◽  
Vol 6 (4) ◽  
pp. 149 ◽  
Author(s):  
Tao Li ◽  
Lei Ma ◽  
Zheng Liu ◽  
Kaitong Liang

In the context of the application of artificial intelligence in an intellectual property trading platform, the number of demanders and suppliers that exchange scarce resources is growing continuously. Improvement of computational power promotes matching efficiency significantly. It is necessary to greatly reduce energy consumption in order to realize the machine learning process in terminals and microprocessors in edge computing (smart phones, wearable devices, automobiles, IoT devices, etc.) and reduce the resource burden of data centers. Machine learning algorithms generated in an open community lack standardization in practice, and hence require open innovation participation to reduce computing cost, shorten algorithm running time, and improve human-machine collaborative competitiveness. The purpose of this study was to find an economic range of the granularity in a decision tree, a popular machine learning algorithm. This work addresses the research questions of what the economic tree depth interval is and what the corresponding time cost is with increasing granularity given the number of matches. This study also aimed to balance the efficiency and cost via simulation. Results show that the benefit of decreasing the tree search depth brought by the increased evaluation granularity is not linear, which means that, in a given number of candidate matches, the granularity has a definite and relatively economical range. The selection of specific evaluation granularity in this range can obtain a smaller tree depth and avoid the occurrence of low efficiency, which is the excessive increase in the time cost. Hence, the standardization of an AI algorithm is applicable to edge computing scenarios, such as an intellectual property trading platform. The economic granularity interval can not only save computing resource costs but also save AI decision-making time and avoid human decision-maker time cost.


2020 ◽  
Vol 17 (9) ◽  
pp. 4294-4298
Author(s):  
B. R. Sunil Kumar ◽  
B. S. Siddhartha ◽  
S. N. Shwetha ◽  
K. Arpitha

This paper intends to use distinct machine learning algorithms and exploring its multi-features. The primary advantage of machine learning is, a machine learning algorithm can predict its work automatically by learning what to do with information. This paper reveals the concept of machine learning and its algorithms which can be used for different applications such as health care, sentiment analysis and many more. Sometimes the programmers will get confused which algorithm to apply for their applications. This paper provides an idea related to the algorithm used on the basis of how accurately it fits. Based on the collected data, one of the algorithms can be selected based upon its pros and cons. By considering the data set, the base model is developed, trained and tested. Then the trained model is ready for prediction and can be deployed on the basis of feasibility.


Author(s):  
Himanshu Verma

Many attempts were made to classify the bees that is bumble bee or honey bee , there have been such a large amount of researches which were made to seek out the difference between them on the premise of various features like wing size , size of bee , color, life cycle and many more. But altogether the analysis there have been either that specialize in qualitative or quantitative , but to beat this issue , thus researchers came up with an answer which might be both qualitative and quantitative analysis made to classify them. And making use of machine learning algorithm to classify them gives a lift . Now the classification would take less time as these algorithms are pretty fast and accurate . By using machine learning work is made easy . Lots of photographs had to be collected and stored for data set. And by using these machine learning algorithms we would be getting information about the bees which might be employed by researchers in further classification of bees. Manipulation of images had to be done so as on prepare them in such a way that they will be applied to the algorithms and have feature extraction done. As there have been a lot of photographs(data set) which take a lot of space and also the area in which bees were present in these photographs were too small so to accommodate it dimension reduction was done , it might not consider other images like trees , leaves , flowers which were there present in the photograph which we elect as a data set.


Diabetes is a most common disease that occurs to most of the humans now a day. The predictions for this disease are proposed through machine learning techniques. Through this method the risk factors of this disease are identified and can be prevented from increasing. Early prediction in such disease can be controlled and save human’s life. For the early predictions of this disease we collect data set having 8 attributes diabetic of 200 patients. The patients’ sugar level in the body is tested by the features of patient’s glucose content in the body and according to the age. The main Machine learning algorithms are Support vector machine (SVM), naive bayes (NB), K nearest neighbor (KNN) and Decision Tree (DT). In the exiting the Naive Bayes the accuracy levels are 66% but in the Decision tree the accuracy levels are 70 to 71%. The accuracy levels of the patients are not proper in range. But in XG boost classifiers even after the Naïve Bayes 74 Percentage and in Decision tree the accuracy levels are 89 to 90%. In the proposed system the accuracy ranges are shown properly and this is only used mostly. A dataset of 729 patients can be stored in Mongo DB and in that 129 patients repots are taken for the prediction purpose and the remaining are used for training. The training datasets are used for the prediction purposes.


Author(s):  
E. Seyedkazemi Ardebili ◽  
S. Eken ◽  
K. Küçük

Abstract. After a brief look at the smart home, we conclude that to have a smart home, and it is necessary to have an intelligent management center. In this article, We have tried to make it possible for the smart home management center to be able to detect the presence of an abnormal state in the behavior of someone who lives in the house. In the proposed method, the daily algorithm examines the rate of changes of a person and provides a number which is henceforth called NNC (Number of normal changes) based on the person’s behavioral changes. We achieve the NNC number using a machine learning algorithm and performing a series of several simple statistical and mathematical calculations. NNC is a number that shows abnormal changes in residents’ behaviors in a smart home, i.e., this number is a small number for a regular person with constant planning and for a person who may not have any fixed principles and regular in personal life is a big number.To increase our accuracy in calculating NNC, we review all common machine learning algorithms and after tests we choose the decision tree because of its higher accuracy and speed and finally, NNC number is obtained by combining the Decision Tree algorithm with statistical and mathematical methods. In this method, we present a set of states and information obtained from the sensors along with the activities performed by the occupant of the house over a period of several days to the proposed algorithm. and the method ahead generates the main NNC number for those days for anyone living in a smart home. To generate this main NNC, we calculate each person’s daily NNC. That means we have daily NNCs for each person (based on his/her behaviors on that day) and the main NNC is the average of these daily NNC. We chose ARAS dataset (Human Activity Datasets in Multiple Homes with Multiple Residents) to implement our method and after tests and replications on the ARAS dataset, and to find anomalies in each person’s behavior in a day, we compare the main (average) NNC with that person’s daily NNC on that day. Finally, we can say, if the main NNC changes more than 30%, there is a possibility of an abnormality. and if the NNC changes more than 60% percent, we can say that an abnormal state or an uncommon event happened that day, and a declaration of an abnormal state will be issued to the resident of the house.


2021 ◽  
Author(s):  
Marc Raphael ◽  
Michael Robitaille ◽  
Jeff Byers ◽  
Joseph Christodoulides

Abstract Machine learning algorithms hold the promise of greatly improving live cell image analysis by way of (1) analyzing far more imagery than can be achieved by more traditional manual approaches and (2) by eliminating the subjective nature of researchers and diagnosticians selecting the cells or cell features to be included in the analyzed data set. Currently, however, even the most sophisticated model based or machine learning algorithms require user supervision, meaning the subjectivity problem is not removed but rather incorporated into the algorithm’s initial training steps and then repeatedly applied to the imagery. To address this roadblock, we have developed a self-supervised machine learning algorithm that recursively trains itself directly from the live cell imagery data, thus providing objective segmentation and quantification. The approach incorporates an optical flow algorithm component to self-label cell and background pixels for training, followed by the extraction of additional feature vectors for the automated generation of a cell/background classification model. Because it is self-trained, the software has no user-adjustable parameters and does not require curated training imagery. The algorithm was applied to automatically segment cells from their background for a variety of cell types and five commonly used imaging modalities - fluorescence, phase contrast, differential interference contrast (DIC), transmitted light and interference reflection microscopy (IRM). The approach is broadly applicable in that it enables completely automated cell segmentation for long-term live cell phenotyping applications, regardless of the input imagery’s optical modality, magnification or cell type.


Sign in / Sign up

Export Citation Format

Share Document