scholarly journals Performance Evaluation of Naive Bayes Classifier with and without Filter Based Feature Selection

Customer Relationship Ma agement tends to analyze datasets to find insights about data which in turn helps to frame the business strategy for improvement of enterprises. Analyzing data in CRM requires high intensive models. Machine Learning (ML) algorithms help in analyzing such large dimensional datasets. In most real time datasets, the strong independence assumption of Naive Bayes (NB) between the attributes are violated and due to other various drawbacks in datasets like irrelevant data, partially irrelevant data and redundant data, it leads to poor performance of prediction. Feature selection is a preprocessing method applied, to enhance the predication of the NB model. Further, empirical experiments are conducted based on NB with Feature selection and NB without feature selection. In this paper, a empirical study of attribute selection is experimented for five dissimilar filter based feature selection such as Relief-F, Pearson correlation (PCC), Symmetrical Uncertainty (SU), Gain Ratio (GR) and Information Gain (IG).

2019 ◽  
Vol 17 (1) ◽  
pp. 1
Author(s):  
Muqorobin Muqorobin ◽  
Kusrini Kusrini ◽  
Emha Taufiq Luthfi

The cost of education is one component of input that is very important in implementing education. Because costs are the main requirement in an effort to achieve educational goals. SMK Al-Islam Surakarta is a private education institution that requires students to pay school fees in the form of Education Development Donations. Educational Development Donation is a routine school fee that is conducted every month. Based on last year's TU report, many students were late in paying Education Development Donations, around 60%. This is a big problem. The purpose of this study is that researchers will build a predictive system using the Naïve Bayes method. Because the method can classify the class right or late, in the payment of school fees. Data processing was taken from the dapodik data of schools in 2017/2018 with the test dataset taking 30 records. To find out the level of accuracy, this research was conducted with the Naive Bayes Method and the Information Gain Method for feature selection. Accuracy testing is done by the Confusion Matrix method. The results showed that the highest accuracy was obtained by combining the Naive Bayes Method with the Information Gain Method obtained by 90% accuracy. 


Author(s):  
Oman Somantri ◽  
Dyah Apriliani

<p>Conducting an assessment of consumer sentiments taken from social media in assessing a culinary food gives useful information for everyone who wants to get this information especially for migrants and tourists, in th other hand that information is very valuable for food stall and restaurant owners as information in improvinf food quality. Overcoming this problem, a sentiment analysis classification model using naïve bayes algorithm (NB) was applied to get this information. This problem occurs is the level of accuracy of classification of consumer ratings of culinary food is still not optimal because the weight of values in the data preprocessing process are not optimal. In this paper proposed a hybrid feature selection models to overcome the problems in the process of selecting the feature attributes that have not been optimal by using a combination of information gain (IG) and genetic algorithm (GA) algorithms. The result of this research showed that after the experiment and compared to using others algorithms produce the best of the level occuracy is 93%.</p>


2020 ◽  
Vol 4 (3) ◽  
pp. 486
Author(s):  
Bintang Peryoga ◽  
Adiwijaya Adiwijaya ◽  
Widi Astuti

Cancer is a deadly disease that is responsible for 9.6 million death in 2018 based on WHO data so early cancer detection is needed so can be treated immediately and cancer deaths can be reduced. Microarray is technology that can monitor and analyze the expression of cancer genes in microarray data but has high data dimension and small sample so dimensional reductions are needed for the optimal classification process. Dimension reduction can reduce the use of features for the classification process by selecting some influential features. Hybrid method is one dimension reduction by combining Filter method with Wrapper so it gets the both advantage. In this case, researchers combined Naïve Bayes with Hybrid Feature Selection (Information Gain - Genetic Algorithm) on cancer data for microarray Lung Cancer, Ovarian Cancer, Breast Cancer, Colon Tumors, and Prostate Tumors. These data were obtained from Kent-Ridge Biomedical Dataset. The results showed that from 5 data used, 4 data obtained an accuracy between 87-100% while the prostate tumor data obtained the smallest accuracy of 61.14%. The implementation of the feature selection method and the classification of the 5 cancer data above only uses less than 63 features to obtain this accuracy


2019 ◽  
Vol 1196 ◽  
pp. 012021
Author(s):  
Ahmad Fali Oklilas ◽  
Tasmi ◽  
Sri Desy Siswanti ◽  
Mira Afrina ◽  
Herri Setiawan

2020 ◽  
Vol 11 (1) ◽  
pp. 1
Author(s):  
Riska Wibowo ◽  
Henny Indriyawati

Abstract. Becoming one of the society health problems in the world, hepatitis is an inflammation liver disease caused by a virus, bacterial infection, chemical substances including drugs and alcohol. In this research, for the dataset of hepatitis having high dimensionality, its value for each attribute was calculated using weight information gain method. Then, the attributes were selected by using top-k methods and were classified by using Naïve Bayes Algorithm respectively. This research showed that 9 out of 20 attributes had chosen to be the highest top-9 with an accuracy rate of 85.57%. Later on, this research can be useful for a consideration in a decision making process for various subjects related to feature selection and Naïve Bayes Algorithm method and also for predicting hepatitis.Keywords: data mining, weight information gain, Naïve Bayes algorithmAbstrak. Penyakit hepatitis merupakan masalah kesehatan masyarakat di dunia. Penyakit hepatitis merupakan penyakit peradangan hati yang disebabkan oleh virus, infeksi bakteri, zat-zat kimia termasuk obat-obatan dan alkohol. Pada penelitian ini, dataset hepatitis yang memiliki data berdimensi tinggi akan dihitung nilai bobot dari masing-masing atribut menggunakan metode weight information gain. Setelah dihitung nilai bobot dilakukan pemilihan atribut, atribut yang dipilih menggunakan metode top-k. Kemudian dilakukan klasifikasi menggunakan algoritme Naïve Bayes. Hasil penelitian menunjukkan dari 20 atribut, terpilih top-9 tertinggi dengan nilai akurasi 85.57%. Dengan adanya penelitian ini dapat digunakan sebagai bahan pertimbangan dan pengambilan keputusan pada berbagai bidang yang berkaitan dengan metode feature selection, algoritme Naïve Bayes, dan di dalam memprediksi penyakit hepatitis.Kata Kunci: data mining, weight information gain, algoritma Naïve Bayes


Many fraud transactions exist in the online world that affects various financial institutions but Credit Card Fraud transaction is the most occurring problem in the world. Credit Card fraud is the situation in which fraudsters misuse credit cards for illegal purposes. Hence, detection of fraudulent transactions is essen-tial. Several researchers have worked on detecting fraud transactions and also provide solutions whose surveys are given in this paper. This study makes a major contribution to research on the detection of Credit Card fraud transactions through Machine Learning Algorithms suchas Decision Tree and Naive Bayes. The data have been selected from Kag-gle and categorize into training (80%) and testing (20%) data. The whole experiment was performed on the Jupyter Notebook tool for which the Anaconda Navigator has been installed. The Heatmap is used for visualization and colorfully represents the data. The main aim of this work is to balance the dataset with Near-Miss Under-sampling Method. The information gain method is applied for feature selection. The best algorithm founded in this paper is Decision Tree with 97% accuracy as compared to Naïve Bayes with 90%. The results are achieved based on Accuracy, Recall, Precision, and F1-score. We have also shown the ROC Curve and Precision-Recall Curve of the algorithm in this paper.


2021 ◽  
Vol 5 (1) ◽  
pp. 332
Author(s):  
Kurniabudi Kurniabudi ◽  
Abdul Harris ◽  
Albertus Edward Mintaria

Large data dimensionality is one of the issues in anomaly detection. One approach used to overcome large data dimensions is feature selection. An effective feature selection technique will produce the most relevant features and can improve the classification algorithm to detect attacks. There have been many studies on feature selection techniques, each using different methods and strategies to find the best and relevant features. In this study, a comparison of Information Gain, Gain Ratio, CFs-BestFirst and CFs-PSO Search techniques was compared. The selection features of the four techniques were further validated by the Naive Bayes classification algorithm, k-NN and J48. This study uses the ISCX CICIDS-2017 dataset. Based on the test results the feature selection techniques affect the performance of the Naive Bayes algorithm, k-NN and J48. Increasingly relevant and important features can improve detection performance. The test results also show that the number of features influences the processing / computing time. CFs-BestFirst produces a smaller number of features compared to CFs-PSO Search, Information Gain and Gain Ratio so it requires lower processing time. In addition, k-NN requires a higher processing time than Naive Bayes and J48


2019 ◽  
Vol 2 (4) ◽  
pp. 135
Author(s):  
Saipul Anwar ◽  
Fajar Septian ◽  
Ristasari Dwi Septiana

Intrusion Detection System (IDS) is useful for detecting an attack or disturbance on a network or information system. Anomaly detection is a type of IDS that can detect a deviate attack on the network based on statistical probability. The increasing use of the internet also increases interference or attacks from intruders or crackers that exploit weak internet protocols and application software. When many data packets arrive, a problem arises that needs to be analyzed. The right technique to analyze the data package is data mining. This study aims to classify IDS anomalies using the Naïve Bayes classification algorithm from the results of attribute selection with correlation-based feature selection. This study uses a UNSW-NB15 intrusion detection system data collection consisting of 49 attributes and 321,283 data records. Performance measurements are based on accuracy, precision, F-Measure and ROC Area. The results of attribute selection with correlation-based feature selection leave 4 attributes. The results of the evaluation of IDS anomaly classification using the naïve Bayes algorithm without the precedence of the attributes selected by the correlation technique obtained an accuracy rate of 71.2%. While the classification results if preceded by the attributes selected by the correlation technique obtained an accuracy of 74.8%. Classification with the naïve Bayes algorithm can be improved its accuracy which is preceded by the selection of attributes with correlation techniques.


Sign in / Sign up

Export Citation Format

Share Document