A Comparative Study of Decision Tree and Naive Bayes Machine Learning Model for Crime Category Prediction in Chicago

Author(s):  
Bshayer S. Aldossari ◽  
Futun M. Alqahtani ◽  
Noura S. Alshahrani ◽  
Manar M. Alhammam ◽  
Razan M. Alzamanan ◽  
...  
2020 ◽  
Vol 1 (2) ◽  
pp. 61-66
Author(s):  
Febri Astiko ◽  
Achmad Khodar

This study aims to design a machine learning model of sentiment analysis on Indosat Ooredoo service reviews on social media twitter using the Naive Bayes algorithm as a classifier of positive and negative labels. This sentiment analysis uses machine learning to get patterns an model that can be used again to predict new data.


Author(s):  
Dhilsath Fathima.M ◽  
S. Justin Samuel ◽  
R. Hari Haran

Aim: This proposed work is used to develop an improved and robust machine learning model for predicting Myocardial Infarction (MI) could have substantial clinical impact. Objectives: This paper explains how to build machine learning based computer-aided analysis system for an early and accurate prediction of Myocardial Infarction (MI) which utilizes framingham heart study dataset for validation and evaluation. This proposed computer-aided analysis model will support medical professionals to predict myocardial infarction proficiently. Methods: The proposed model utilize the mean imputation to remove the missing values from the data set, then applied principal component analysis to extract the optimal features from the data set to enhance the performance of the classifiers. After PCA, the reduced features are partitioned into training dataset and testing dataset where 70% of the training dataset are given as an input to the four well-liked classifiers as support vector machine, k-nearest neighbor, logistic regression and decision tree to train the classifiers and 30% of test dataset is used to evaluate an output of machine learning model using performance metrics as confusion matrix, classifier accuracy, precision, sensitivity, F1-score, AUC-ROC curve. Results: Output of the classifiers are evaluated using performance measures and we observed that logistic regression provides high accuracy than K-NN, SVM, decision tree classifiers and PCA performs sound as a good feature extraction method to enhance the performance of proposed model. From these analyses, we conclude that logistic regression having good mean accuracy level and standard deviation accuracy compared with the other three algorithms. AUC-ROC curve of the proposed classifiers is analyzed from the output figure.4, figure.5 that logistic regression exhibits good AUC-ROC score, i.e. around 70% compared to k-NN and decision tree algorithm. Conclusion: From the result analysis, we infer that this proposed machine learning model will act as an optimal decision making system to predict the acute myocardial infarction at an early stage than an existing machine learning based prediction models and it is capable to predict the presence of an acute myocardial Infarction with human using the heart disease risk factors, in order to decide when to start lifestyle modification and medical treatment to prevent the heart disease.


In this never-ending social media era it is estimated that over 5 billion people use smartphones. Out of these, there are over 1.5 billion active users in the world. In which we all are a major part and before opening our messages we all are curious about what message we have received. No doubt, we all always hope for a good message to be received. So Sentiment analysis on social media data has been seen by many as an effective tool to monitor user preferences and inclination. Finally, we propose a scalable machine learning model to analyze the polarity of a communicative text using Naive Bayes’ Bernoulli classifier. This paper works on only two polarities that is whether the sentence is positive or negative. Bernoulli classifier is used in this paper because it is best suited for binary inputs which in turn enhances the accuracy of up to 97%.


2021 ◽  
Author(s):  
Son Hoang ◽  
Tung Tran ◽  
Tan Nguyen ◽  
Tu Truong ◽  
Duy Pham ◽  
...  

Abstract This paper reports a successful case study of applying machine learning to improve the history matching process, making it easier, less time-consuming, and more accurate, by determining whether Local Grid Refinement (LGR) with transmissibility multiplier is needed to history match gas-condensate wells producing from geologically complex reservoirs as well as determining the required LGR setup to history match those gas-condensate producers. History matching Hai Thach gas-condensate production wells is extremely challenging due to the combined effect of condensate banking, sub-seismic fault network, complex reservoir distribution and connectivity, uncertain HIIP, and lack of PVT data for most reservoirs. In fact, for some wells, many trial simulation runs were conducted before it became clear that LGR with transmissibility multiplier was required to obtain good history matching. In order to minimize this time-consuming trial-and-error process, machine learning was applied in this study to analyze production data using synthetic samples generated by a very large number of compositional sector models so that the need for LGR could be identified before the history matching process begins. Furthermore, machine learning application could also determine the required LGR setup. The method helped provide better models in a much shorter time, and greatly improved the efficiency and reliability of the dynamic modeling process. More than 500 synthetic samples were generated using compositional sector models and divided into separate training and test sets. Multiple classification algorithms such as logistic regression, Gaussian Naive Bayes, Bernoulli Naive Bayes, multinomial Naive Bayes, linear discriminant analysis, support vector machine, K-nearest neighbors, and Decision Tree as well as artificial neural networks were applied to predict whether LGR was used in the sector models. The best algorithm was found to be the Decision Tree classifier, with 100% accuracy on the training set and 99% accuracy on the test set. The LGR setup (size of LGR area and range of transmissibility multiplier) was also predicted best by the Decision Tree classifier with 91% accuracy on the training set and 88% accuracy on the test set. The machine learning model was validated using actual production data and the dynamic models of history-matched wells. Finally, using the machine learning prediction on wells with poor history matching results, their dynamic models were updated and significantly improved.


Diabetes is a most common disease that occurs to most of the humans now a day. The predictions for this disease are proposed through machine learning techniques. Through this method the risk factors of this disease are identified and can be prevented from increasing. Early prediction in such disease can be controlled and save human’s life. For the early predictions of this disease we collect data set having 8 attributes diabetic of 200 patients. The patients’ sugar level in the body is tested by the features of patient’s glucose content in the body and according to the age. The main Machine learning algorithms are Support vector machine (SVM), naive bayes (NB), K nearest neighbor (KNN) and Decision Tree (DT). In the exiting the Naive Bayes the accuracy levels are 66% but in the Decision tree the accuracy levels are 70 to 71%. The accuracy levels of the patients are not proper in range. But in XG boost classifiers even after the Naïve Bayes 74 Percentage and in Decision tree the accuracy levels are 89 to 90%. In the proposed system the accuracy ranges are shown properly and this is only used mostly. A dataset of 729 patients can be stored in Mongo DB and in that 129 patients repots are taken for the prediction purpose and the remaining are used for training. The training datasets are used for the prediction purposes.


Author(s):  
P. Chandra Sandeep

CharityML is a fictional non-earnings company created for the only motive of the usage of for this project. Many non-earnings groups try at the donations they get hold of and specifically they need to be very choosy in whom to reach for the donations. In our project, we used numerous supervised algorithms of our concern to as it should be model the individuals' profits with the usage of records accumulated from the 1994 U.S. Census. You will then select the first-rate set of rules from the initial values and then by using the initial values optimize this set of rules for better prediction. Your purpose with this implementation is to assemble a version that asit should be predicts whether or not a man or woman makes extra than 50,000 dollars. This type form undertakings are going to help in a non-earnings company setup, wherein groups live on donations. Understanding a character's profits can assist non-earnings company higher apprehend how huge of a grant to request, or whether or not no longer they need to attain out to start with. While it is able to be hard to decide a character's standard profits bracket form the known sources, we will infer this price from different publicly to be had features. The dataset for this assignment originates from the UCI Machine Learning Repository. The dataset become donated with the aid of using Ron Kohavi and Barry Becker, after being posted withinside the article "Scaling Up the Accuracy of Naive-Bayes Classifiers: A Decision-Tree Hybrid". The records we inspect right here includes few modifications to the raw dataset, which include disposing of the 'hgtre' attribute and information with lacking or ill-formatted fields.


2022 ◽  
Vol 2161 (1) ◽  
pp. 012015
Author(s):  
V Sai Krishna Reddy ◽  
P Meghana ◽  
N V Subba Reddy ◽  
B Ashwath Rao

Abstract Machine Learning is an application of Artificial Intelligence where the method begins with observations on data. In the medical field, it is very important to make a correct decision within less time while treating a patient. Here ML techniques play a major role in predicting the disease by considering the vast amount of data that is produced by the healthcare field. In India, heart disease is the major cause of death. According to WHO, it can predict and prevent stroke by timely actions. In this paper, the study is useful to predict cardiovascular disease with better accuracy by applying ML techniques like Decision Tree and Naïve Bayes and also with the help of risk factors. The dataset that we considered is the Heart Failure Dataset which consists of 13 attributes. In the process of analyzing the performance of techniques, the collected data should be pre-processed. Later, it should follow by feature selection and reduction.


2021 ◽  
Vol 2021 ◽  
pp. 1-14
Author(s):  
Fatmah Abdulrahman Baothman

A humanoid robot’s development requires an incredible combination of interdisciplinary work from engineering to mathematics, software, and machine learning. NAO is a humanoid bipedal robot designed to participate in football competitions against humans by 2050, and speed is crucial for football sports. Therefore, the focus of the paper is on improving NAO speed. This paper is aimed at testing the hypothesis of whether the humanoid NAO walking speed can be improved without changing its physical configuration. The applied research method compares three classification techniques: artificial neural network (ANN), Naïve Bayes, and decision tree to measure and predict NAO’s best walking speed, then select the best method, and enhance it to find the optimal average velocity speed. According to Aldebaran documentation, the real NAO’s robot default walking speed is 9.52 cm/s. The proposed work was initiated by studying NAO hardware platform limitations and selecting Nao’s gait 12 parameters to measure the accuracy metrics implemented in the three classification models design. Five experiments were designed to model and trace the changes for the 12 parameters. The preliminary NAO’s walking datasets open-source available at GitHub, the NAL, and RoboCup datasheets are implemented. All generated gaits’ parameters for both legs and feet in the experiments were recorded using the Choregraphe software. This dataset was divided into 30% for training and 70% for testing each model. The recorded gaits’ parameters were then fed to the three classification models to measure and predict NAO’s walking best speed. After 500 training cycles for the Naïve Bayes, the decision tree, and ANN, the RapidMiner scored 48.20%, 49.87%, and 55.12%, walking metric speed rate, respectively. Next, the emphasis was on enhancing the ANN model to reach the optimal average velocity walking speed for the real NAO. With 12 attributes, the maximum accuracy metric rate of 65.31% was reached with only four hidden layers in 500 training cycles with a 0.5 learning rate for the best walking learning process, and the ANN model predicted the optimal average velocity speed of 51.08% without stiffness: V 1 = 22.62   cm / s , V 2 = 40   cm / s , and V = 30   cm / s . Thus, the tested hypothesis holds with the ANN model scoring the highest accuracy rate for predicting NAO’s robot walking state speed by taking both legs to gauge joint 12 parameter values.


2021 ◽  
Vol 2021 ◽  
pp. 1-12
Author(s):  
Yi Yin ◽  
Mingyue Xue ◽  
Lingen Shi ◽  
Tao Qiu ◽  
Derun Xia ◽  
...  

Objective. To establish a machine learning model for identifying patients coinfected with hepatitis B virus (HBV) and human immunodeficiency virus (HIV) through two sexual transmission routes in Jiangsu, China. Methods. A total of 14197 HIV cases transmitted by homosexual and heterosexual routes were recruited. After data processing, 12469 cases (HIV and HBV, 1033; HIV, 11436) were left for further analysis, including 7849 cases with homosexual transmission and 4620 cases with heterosexual transmission. Univariate logistic regression was used to select variables with significant P value and odds ratio for multivariable analysis. In homosexual transmission and heterosexual transmission groups, 10 and 6 variables were selected, respectively. For identifying HIV individuals coinfected with HBV, a machine learning model was constructed with four algorithms, including Decision Tree, Random Forest, AdaBoost with decision tree (AdaBoost), and extreme gradient boosting decision tree (XGBoost). The detective value of each variable was calculated using the optimal machine learning algorithm. Results. AdaBoost algorithm showed the highest efficiency in both transmission groups (homosexual transmission group: accuracy = 0.928 , precision = 0.915 , recall = 0.944 , F − 1 = 0.930 , and AUC = 0.96 ; heterosexual transmission group: accuracy = 0.892 , precision = 0.881 , recall = 0.905 , F − 1 = 0.893 , and AUC = 0.98 ). Calculated by AdaBoost algorithm, the detective value of PLA was the highest in homosexual transmission group, followed by CR, AST, HB, ALT, TBIL, leucocyte, age, marital status, and treatment condition; in the heterosexual transmission group, the detective value of PLA was the highest (consistent with the condition in the homosexual group), followed by ALT, AST, TBIL, leucocyte, and symptom severity. Conclusions. The univariate logistics regression combined with the AdaBoost algorithm could accurately screen the risk factors of HBV in HIV coinfection without invasive testing. Further studies are needed to evaluate the utility and feasibility of this model in various settings.


Sign in / Sign up

Export Citation Format

Share Document