Applying Naive Bayes Classifier to Document Clustering

Author(s):  
Jie Ji ◽  
◽  
Qiangfu Zhao

Document clustering partitions sets of unlabeled documents so that documents in clusters share common concepts. A Naive Bayes Classifier (BC) is a simple probabilistic classifier based on applying Bayes’ theorem with strong (naive) independence assumptions. BC requires a small amount of training data to estimate parameters required for classification. Since training data must be labeled, we propose an Iterative Bayes Clustering (IBC) algorithm. To improve IBC performance, we propose combining IBC with Comparative Advantage-based (CA) initialization method. Experimental results show that our proposal improves performance significantly over classical clustering methods.

2020 ◽  
Vol 17 (1) ◽  
pp. 37-42
Author(s):  
Yuris Alkhalifi ◽  
Ainun Zumarniansyah ◽  
Rian Ardianto ◽  
Nila Hardi ◽  
Annisa Elfina Augustia

Non-Cash Food Assistance or Bantuan Pangan Non-Tunai (BPNT) is food assistance from the government given to the Beneficiary Family (KPM) every month through an electronic account mechanism that is used only to buy food at the Electronic Shop Mutual Assistance Joint Business Group Hope Family Program (e-Warong KUBE PKH ) or food traders working with Bank Himbara. In its distribution, BPNT still has problems that occur that are experienced by the village apparatus especially the apparatus of Desa Wanasari on making decisions, which ones are worthy of receiving (poor) and not worthy of receiving (not poor). So one way that helps in making decisions can be done through the concept of data mining. In this study, a comparison of 2 algorithms will be carried out namely Naive Bayes Classifier and Decision Tree C.45. The total sample used is as much as 200 head of household data which will then be divided into 2 parts into validation techniques is 90% training data and 10% test data of the total sample used then the proposed model is made in the RapidMiner application and then evaluated using the Confusion Matrix table to find out the highest level of accuracy from 2 of these methods. The results in this classification indicate that the level of accuracy in the Naive Bayes Classifier method is 98.89% and the accuracy level in the Decision Tree C.45 method is 95.00%. Then the conclusion that in this study the algorithm with the highest level of accuracy is the Naive Bayes Classifier algorithm method with a difference in the accuracy rate of 3.89%.


SinkrOn ◽  
2020 ◽  
Vol 5 (1) ◽  
Author(s):  
Miftahul Kahfi Al Fath ◽  
Arini Arini ◽  
Nasrul Hakiem

Sentiment analysis is an important and emerging research topic today. Sentiment analysis is done to see opinion or tendency of opinion to a problem or object by someone, whether it tends to have a negative or positive view. The main purpose of this study is to find out public sentiment on Full Day school's policy comment from Facebook Page of Kemendikbud RI and to find out the performance of the Naïve Bayes Classifier Algorithm. In this study, the authors used the Naïve Bayes Classifier algorithm with trigram and quad ram character feature selection with two different training data models and labeling of training data using Lexicon Based method in the classification of public sentiment toward the Full day school policy. The result of this research shows that public negative sentiment toward Full Day School policy is more than positive or neutral sentiment. The highest accuracy value is the Naïve Bayes Classifier algorithm with trigram feature selection of 300 data training models with a value of 80%. The greater of training data and feature selection used on the Naïve Bayes Classifier Algorithm affected the accurate result.


Repositor ◽  
2019 ◽  
Vol 1 (2) ◽  
pp. 125
Author(s):  
Vinna Rahmayanti ◽  
Setio Basuki ◽  
Hilman Hilman

It is undeniable that technological progress is developing very quickly in the field of computers, now with computers the work that was originally done by humans can be taken over by computers to help human work itself, like case studi of this research is a system that can classification the text like synopsis into genre group. Genre is the style of story in a novel, there are many genres in the novel that are expected to be romantic, comedy, mystery, horror and others, by knowing the genre of the novel the reader will be able to know the story style of the novel. The method used in this research is TF-IDF (Term Frequency Inverse Document Frequency) and Naïve Bayes Classifier. The TF-IDF method is used to get the weight of each word contained in the resulting document is used in the Naïve Bayes Classifier method to get the synopsis classification results into genre. Based on the evaluation using a confusion matrix using 600 training data and 200 test data obtained an accuracy of 80.5%.AbstractIt is undeniable that technological progress is developing very quickly in the field of computers, now with computers the work that was originally done by humans can be taken over by computers to help human work itself, like case studi of this research is a system that can classification the text like synopsis into genre group. Genre is the style of story in a novel, there are many genres in the novel that are expected to be romantic, comedy, mystery, horror and others, by knowing the genre of the novel the reader will be able to know the story style of the novel. The method used in this research is TF-IDF (Term Frequency Inverse Document Frequency) and Naïve Bayes Classifier. The TF-IDF method is used to get the weight of each word contained in the resulting document is used in the Naïve Bayes Classifier method to get the synopsis classification results into genre. Based on the evaluation using a confusion matrix using 600 training data and 200 test data obtained an accuracy of 80.5%.


Author(s):  
Mohammad Zoqi Sarwani ◽  
Muhammad Shubkhan Salafudin ◽  
Dian Ahkam Sani

With the development of social media trends among students by using Facebook social media, students can communicate and pour out everything that is felt in the form of status. Personality is the character or various characters of a person - therefore, how a person to adjust to the surrounding environment for the achievement of communication smoothly. In the personality category, many things classify a person's category in the psychologist theory. In this exercise, the Big Five, the psychologist theory, is described in five codes, namely Openness, Conscientiousness, Extraversion, Agreeables, Neuroticism. Naive Bayes Classifier is used to determine the highest probability value with the aim to determine the highest value. The data used are two namely training data and testing data obtained from the Facebook status of students. From the data obtained can be tested in the system that the accuracy value is 88%.


2012 ◽  
Vol 5s1 ◽  
pp. BII.S8945 ◽  
Author(s):  
Irena Spasić ◽  
Pete Burnap ◽  
Mark Greenwood ◽  
Michael Arribas-Ayllon

The authors present a system developed for the 2011 i2b2 Challenge on Sentiment Classification, whose aim was to automatically classify sentences in suicide notes using a scheme of 15 topics, mostly emotions. The system combines machine learning with a rule-based methodology. The features used to represent a problem were based on lexico–semantic properties of individual words in addition to regular expressions used to represent patterns of word usage across different topics. A naïve Bayes classifier was trained using the features extracted from the training data consisting of 600 manually annotated suicide notes. Classification was then performed using the naïve Bayes classifier as well as a set of pattern–matching rules. The classification performance was evaluated against a manually prepared gold standard consisting of 300 suicide notes, in which 1,091 out of a total of 2,037 sentences were associated with a total of 1,272 annotations. The competing systems were ranked using the micro-averaged F-measure as the primary evaluation metric. Our system achieved the F-measure of 53% (with 55% precision and 52% recall), which was significantly better than the average performance of 48.75% achieved by the 26 participating teams.


2021 ◽  
Vol 5 (4) ◽  
pp. 389
Author(s):  
Muhammad Ikbal ◽  
Septi Andryana ◽  
Ratih Titi Komala Sari

The covid-19 virus became a pandemic in 2020. The spread of covid cases has hit the whole world, reaching 63 million cases in 190 countries as of November 2020. Information regarding the spread of covid is necessary for the general public. This research will produce a system that can provide information on the geographic distribution of covid cases. The data on the distribution of covid cases in this study were also used to analyze the classification using the Naive Bayes Classifier method. The Naive Bayes Classifier method works by using probability calculations so that this research can be used to classify the covid status in an area. The results of this study have succeeded in providing information on the status of the covid pandemic based on data on covid cases that have occurred around the world. Covid case data becomes training data for the analysis of the Naive Bayes classifier method so that it can determine the status of the Covid pandemic based on test data provided by system users. This research has succeeded in helping users to know the status of the Covid pandemic in an area well because it has reliable training data.Keywords:System, Covid, Naïve Bayes Classifier.


2020 ◽  
Vol 1 (3) ◽  
pp. 185-199
Author(s):  
Khoirul Zuhri ◽  
Nurul Adha Oktarini Saputri

Twitter is a social media that is currently popular, where the public is free to comment and write anything. It is not uncommon for the public to comment with harsh words and even hate speech. The 2019 presidential election drew many comments, some praised, criticized and insulted. To be able to dig up information and classify a text, sentiment analysis is needed. In this study, sentiment analysis is a process of classifying textual documents into two classes, namely negative and positive sentiment classes. Opinion data were obtained from the Twitter social network in the form of tweets. The data used was 3337 tweets consisting of 80% training data and 20% training data. Training data is data with known sentiment. This study aims to determine whether a tweet is a positive or negative tweet conveyed on Twitter in Indonesian. The classification of tweet data uses the naïve Bayes classifier algorithm. The classification results of the test data show that the Naïve Bayes Classifier algorithm provides an accuracy value of 71%. The accuracy value for each sentiment is 71% for positive sentiment and 70% for negative sentiment


Kilat ◽  
2020 ◽  
Vol 9 (1) ◽  
pp. 103-114
Author(s):  
Arini - Arini ◽  
Luh Kesuma Wardhani ◽  
Dimas - Octaviano

Towards an election year (elections) in 2019 to come, many mass campaign conducted through social media networks one of them on twitter. One online campaign is very popular among the people of the current campaign with the hashtag #2019GantiPresiden. In studies sentiment analysis required hashtag 2019GantiPresiden classifier and the selection of robust functionality that mendaptkan high accuracy values. One of the classifier and feature selection algorithms are Naive Bayes classifier (NBC) with Tri-Gram feature selection Character & Term-Frequency which previous research has resulted in a fairly high accuracy. The purpose of this study was to determine the implementation of Algorithm Naive Bayes classifier (NBC) with each selection and compare features and get accurate results from Algorithm Naive Bayes classifier (NBC) with both the selection of the feature. The author uses the method of observation to collect data and do the simulation. By using the data of 1,000 tweets originating from hashtag # 2019GantiPresiden taken on 15 September 2018, the author divides into two categories: 950 tweets as training data and 50 tweets as test data where the labeling process using methods Lexicon Based sentiment. From this study showed Naïve Bayes classifier algorithm accuracy (NBC) with feature selection Character Tri-Gram by 76% and Term-Frequency by 74%,the result show that the feature selection Character Tri-Gram better than Term-Frequency.


Proceedings ◽  
2018 ◽  
Vol 2 (19) ◽  
pp. 1264
Author(s):  
Antonio Jiménez ◽  
Fernando Seco

This short paper presents the activity recognition results obtained from the CAR-CSIC team for the UCAmI’18 Cup. We propose a multi-event naive Bayes classifier for estimating 24 different activities in real-time. We use all the sensorial information provided for the competition, i.e., binary sensors fixed to everyday objects, proximity BLE-based tags, location-aware smart floor sensing and the wrist’s acceleration. The results using training data-sets of 7 days show accuracies (true positives) about 68%; however for the three extra data-sets of the competition we were able to reach a 60.5% accuracy.


2013 ◽  
Vol 23 (4) ◽  
pp. 787-795 ◽  
Author(s):  
Sona Taheri ◽  
Musa Mammadov

Abstract Naive Bayes is among the simplest probabilistic classifiers. It often performs surprisingly well in many real world applications, despite the strong assumption that all features are conditionally independent given the class. In the learning process of this classifier with the known structure, class probabilities and conditional probabilities are calculated using training data, and then values of these probabilities are used to classify new observations. In this paper, we introduce three novel optimization models for the naive Bayes classifier where both class probabilities and conditional probabilities are considered as variables. The values of these variables are found by solving the corresponding optimization problems. Numerical experiments are conducted on several real world binary classification data sets, where continuous features are discretized by applying three different methods. The performances of these models are compared with the naive Bayes classifier, tree augmented naive Bayes, the SVM, C4.5 and the nearest neighbor classifier. The obtained results demonstrate that the proposed models can significantly improve the performance of the naive Bayes classifier, yet at the same time maintain its simple structure.


Sign in / Sign up

Export Citation Format

Share Document