scholarly journals Detection Of Spam Comments On Instagram Using Complementary Naïve Bayes

Author(s):  
Nur Azizul Haqimi ◽  
Nur Rokhman ◽  
Sigit Priyanta

Instagram (IG) is a web-based and mobile social media application where users can share photos or videos with available features. Upload photos or videos with captions that contain an explanation of the photo or video that can reap spam comments. Comments on spam containing comments that are not relevant to the caption and photos. The problem that arises when identifying spam is non-spam comments are more dominant than spam comments so that it leads to the problem of the imbalanced dataset. A balanced dataset can influence the performance of a classification method. This is the focus of research related to the implementation of the CNB method in dealing with imbalance datasets for the detection of Instagram spam comments. The study used TF-IDF weighting with Support Vector Machine (SVM) as a comparison classification. Based on the test results with 2500 training data and 100 test data on the imbalanced dataset (25% spam and 75% non-spam), the CNB accuracy was 92%, precision 86% and f-measure 93%. Whereas SVM produces 87% accuracy, 79% precision, 88% f-measure. In conclusion, the CNB method is more suitable for detecting spam comments in cases of imbalanced datasets.

Repositor ◽  
2019 ◽  
Vol 1 (1) ◽  
pp. 39
Author(s):  
Alimuddin Hasan Al Kabir ◽  
Setio Basuki ◽  
Galih Wasis Wicaksono

Public opinion is one of many instrument that can be used to evaluate of an event. This research based on several problems, they are (1) quality improvement are necessary for Pelatihan Aplikasi Teknologi Informasi, (2) too much of data make the participant’s opinion that has been collected has not been maximally utilized. Criticism and suggestions datas are taken from 2016/2017th school year in amount of 1050. Support Vector Machine is used as a method in sentiment analysis. The data training process will produce the best hyperplane used as a reference to determine which sentiment class is much appropriate for a sentence. The test is done by dividing the dataset into the test data as much as 20% and the training data as much as 80% so it can be done the analysis process up to 5 times iteration with different data arrangement. The test results show the calculation of Accuracy, Precision, Recall, and F-Measure generated by system is equal to 82,08%, 83,42%, 81,16%, and 81,82%.


2019 ◽  
Vol 11 (2) ◽  
pp. 144
Author(s):  
Danar Wido Seno ◽  
Arief Wibowo

Social media writing content growing make a lot of new words that appear on Twitter in the form of words and abbreviations that appear so that sentiment analysis is increasingly difficult to get high accuracy of textual data on Twitter social media. In this study, the authors conducted research on sentiment analysis of the pairs of candidates for President and Vice President of Indonesia in the 2019 Elections. To obtain higher accuracy results and accommodate the problem of textual data development on Twitter, the authors conducted a combination of methods to conduct the sentiment analysis with unsupervised and supervised methods. namely Lexicon Based. This study used Twitter data in October 2018 using the search keywords with the names of each pair of candidates for President and Vice President of the 2019 Elections totaling 800 datasets. From the study with 800 datasets the best accuracy was obtained with a value of 92.5% with 80% training data composition and 20% testing data with a Precision value in each class between 85.7% - 97.2% and Recall value for each class among 78, 2% - 93.5%. With the Lexicon Based method as a labeling dataset, the process of labeling the Support Vector Machine dataset is no longer done manually but is processed by the Lexicon Based method and the dictionary on the lexicon can be added along with the development of data content on Twitter social media.


SINERGI ◽  
2020 ◽  
Vol 24 (2) ◽  
pp. 87
Author(s):  
Mona Cindo ◽  
Dian Palupi Rini ◽  
Ermatita Ermatita

With the advancement of social media and its growth, there is a lot of data that can be presented for research in social mining. Twitter is a microblogging that can be used. In this event, a lot of companies used the data on Twitter to analyze the satisfaction of their customer about product quality. On the other hand, a lot of users use social media to express their daily emotions. The case can be developed into a research study that can be used both to improve product quality, as well as to analyze the opinion on certain events. The research is often called sentiment analysis or opinion mining. While The previous research does a particularly useful feature for sentiment analysis, but it is still a lack of performance. Furthermore, they used Support Vector Machine as a classification method. On the other hand, most researchers found another classification method, which is considered more efficient such as Maximum Entropy. So, this research used two types of a dataset, the general opinion data, and the airline's opinion data. For feature extraction, we employ four feature extraction, such as pragmatic, lexical-grams, pos-grams, and sentiment lexical. For the classification, we use both of Support Vector Machine and Maximum Entropy to find the best result. In the end, the best result is performed by Maximum Entropy with 85,8% accuracy on general opinion data, and 92,6% accuracy on airlines opinion data.


2019 ◽  
Vol 6 (5) ◽  
pp. 543 ◽  
Author(s):  
Fitra A. Bachtiar ◽  
Indra K. Syahputra ◽  
Satrio A. Wicaksono

<p class="Abstrak">Pada setiap awal semester bagian akademik melakukan penjadwalan dan penentuan matakuliah yang akan dibuka untuk semester berikutnya. Akan tetapi proses tersebut memiliki permasalahan antara lain kelas yang dibuka terlalu banyak dibanding jumlah siswa yang berminat atau sebaliknya. Selain itu, dalam permasalahan prediksi data yang terkumpul memiliki kecenderungan tidak seimbang pada setiap kelas (<em>imbalance class</em>). Hal ini akan berdampak pada proses penjadwalan yang kurang tepat. Sehingga dibutuhkan sistem yang dapat memprediksi mahasiswa pengambil mata kuliah. Akan tetapi ada banyak algoritme yang dapat digunakan untuk proses prediksi. Penelitian ini membandingkan performa algoritma untuk klasifikasi mahasiswa pengambil matakuliah. Pada penelitian ini prediksi dilakukan berdasarkan atribut dari data mahasiswa. Atribut-atribut tersebut yaitu Nilai, IP, IPK, SKS, SKSK dan Semester. Pada setiap observasi pada atribut-atribut tersebut prediksi akan dilakukan apakah mahasiswa tersebut mengambil mata kuliah tertentu. Prediksi dibagi menjadi 2 kelas yaitu ‘Ya’ untuk mahasiswa yang diprediksi mengambil matakuliah dan ‘Tidak’ untuk mahasiswa yang diprediksi tidak mengambil matakuliah. Teknik <em>Synthetic Minority Oversampling Technique</em> (SMOTE) digunakan untuk menangani data yang tidak seimbang. Pada penelitian ini klasifikasi dilakukan dengan membandingkan algoritme <em>k</em><em>-Nearest Neighbor </em>(<em>k</em>-NN) dan <em>Support Vector Machine </em>(SVM) untuk kasus prediksi pengambil matakuliah. Hasil pengujian menggunakan 3 mata kuliah sebagai sampel. Dari hasil rerata, diperoleh hasil prediksi <em>k</em>-NN memiliki kinerja yang lebih baik daripada SVM. Selain itu, penggunaan teknik SMOTE dapat mempengaruhi hasil klasifikasi berupa peningkatan nilai AUC, CA, F1, <em>precision</em> dan <em>recall</em>.</p><p class="Abstrak"><strong><br /></strong></p><p class="Abstrak"><strong>Abstract</strong></p><p class="Abstract"><em>At the beginning of each semester, the academic section conducts scheduling and determining the courses offered for the next semester. However, the process has problems such as too many classes offered to the student compared to the number of students who take the class or vice versa. Besides that, in the prediction problems, the collected data has an imbalance tendency in each class. As a result, these problems could cause in ineffective scheduling. Thus, there is a need to build a system that can predict students taking courses. However, there are many algorithms that can be used for the prediction. This study compares the performance of algorithms for classifications of students taking courses. In this study, predictions are modeled based on the attributes of student data, namely Grades, GPA, Cumulative GPA, Semester Credits, Cumulative Semester Credits and Semester. The classification process will be carried out to produce a prediction of whether the student takes a particular subject or not. Classification results are divided into 2 classes, namely 'Yes' for students who are predicted to take and 'No' for students who are predicted not to take the class. To handle imbalance dataset will use Synthetic Minority Oversampling Technique (SMOTE) techniques. Classification method used in this study are k-Nearest Neighbor (k-NN) and Support Vector Machine (SVM) algorithms to compare their performance for prediction cases. The test results used 3 courses as a sample. In average k-NN prediction results have a better performance than SVM. In addition, the use of SMOTE techniques can influence the classification results in the form of an increase in AUC, CA, F1, precision and recall values.<strong></strong></em></p><p class="Abstrak"><strong><br /></strong></p>


Hoax news on social media has had a dramatic effect on our society in recent years. The impact of hoax news felt by many people, anxiety, financial loss, and loss of the right name. Therefore we need a detection system that can help reduce hoax news on social media. Hoax news classification is one of the stages in the construction of a hoax news detection system, and this unsupervised learning algorithm becomes a method for creating hoax news datasets, machine learning tools for data processing, and text processing for detecting data. The next will produce a classification of a hoax or not a Hoax based on the text inputted. Hoax news classification in this study uses five algorithms, namely Support Vector Machine, Naïve Bayes, Decision Tree, Logistic Regression, Stochastic Gradient Descent, and Neural Network (MLP). These five algorithms to produce the best algorithm that can use to detect hoax news, with the highest parameters, accuracy, F-measure, Precision, and recall. From the results of testing conducted on five classification algorithms produced shows that the NN-MPL algorithm has an average of 93% for the value of accuracy, F-Measure, and Precision, the highest compared to five other algorithms, but for the highest Recall value generated from the algorithm SVM which is 94%. the results of this experiment show that different effects for different classifiers, and that means that the more hoax data used as training data, the more accurate the system calculates accuracy in more detail.


CCIT Journal ◽  
2017 ◽  
Vol 10 (2) ◽  
pp. 197-206
Author(s):  
Atika Rahmawati ◽  
Aris Marjuni ◽  
Junta Zeniarja

Pilkada Serentak is a very important event for the future viability regions and countries. Through this election people can cast their vote and elect representatives of the people according to their choice. Public respond can be expressed through twitter social media. Using twitter social media sentiment analysis can then be made about the public response to the implementation of the election simultaneously. The classification process can be detected via text tweeted by twitter users. In this study, the classification of responses detected by text because it is easily obtained and applied. This study determined the classification of the response to the Indonesian language text and increase accuracy by using SVM.Tweet classification method used by the categorical approach is divided into two classes tweet basic level: positive and negative. Data collected from Indonesian twitter tweet as much as 3000. The labeling is not done manually but using clustering method that divides the 3000 data into two groups. Cluster 1 as a group of positive tweets and Cluster 2 as a negative group tweet.2700 for training data and 300 for the test data. The stage of pre-processing the data includetokenization, casenormalization, stop word detection, and stemming. The process of classification using Support Vector Machine (SVM). Accuracy of SVM showed the highest yield that is 91% compared to the k-means clustering with the results of 82%.


2021 ◽  
Vol 2106 (1) ◽  
pp. 012009
Author(s):  
N Hayah ◽  
O Soesanto ◽  
M A Rahman

Abstract The Support Vector Machine (SVM) classification method can be applied in various fields, one of which is meteorology and climatology in rainfall forecasting. Thus, a study was conducted by classifying rainfall to recognize the relationship between global phenomena and rainfall and the results of applying the classification using the SVM method to rainfall in the Tanah Laut Regency. The analysis is carried out using the SVM Multiclass concept with 4 categories of rainfall classification: low, medium, high, and Extreme. The kernel used in SVM is the RBF kernel with optimization parameters used, namely Cost (C) 1,5,10,15 and Gamma (γ) 1,5,10,15. The dataset formed is based on the annual period, climatic conditions, and seasonality. The Spearman Rank correlation test describes the relationship between global phenomena and rainfall with a correlation range of (−0.1456 ) − (0.43144) for the entire dataset. The implementation of the SVM classification method shows that the Cost (C) 10 and Gamma (γ) ≥ 5 parameters obtained the highest accuracy of 100% on the training data. In contrast, in testing the data testing, the accuracy was good, namely the accuracy of 78.00% in La Nina and 81.38% in seasonal periods.


2020 ◽  
Vol 7 (2) ◽  
pp. 379
Author(s):  
Agung Wahyu Setiawan ◽  
Alfie R. Ananda

<p class="Abstrak">Salah satu permasalahan utama dalam industri kelapa sawit adalah proses sortasi Tandan Buah Segar (TBS) di pabrik kelapa sawit. Parameter yang digunakan dalam sortasi TBS adalah jumlah brondolan kelapa sawit. Pada saat ini, sortasi dilakukan oleh <em>grader</em> yang bersifat subyektif dan sering kali tidak konsisten. Hal ini terjadi karena keterbatasan penglihatan dan kemampuan manusia untuk mengolah informasi jumlah brondolan setiap TBS dalam waktu yang terbatas. Oleh karena itu, pada penelitian ini dikembangkan sistem penilaian kematangan TBS kelapa sawit berbasis spektroskopi dan nilai kontras citras. Sumber cahaya yang digunakan pada penelitian ini adalah lampu berjenis <em>Light-emitting Diode</em> (LED) dengan panjang gelombang 680 dan 750 nm. Akuisisi citra TBS dilakukan dengan menggunakan kamera DSLR yang telah dimodifikasi. sehingga diperoleh dua citra TBS pada panjang gelombang 680 dan 750 nm. Kemudian, dilakukan perhitungan nilai kontras kedua citra tersebut. Dalam penelitian ini, terdapat 24 TBS yang digunakan sebagai data latih, dengan komposisi 10 TBS matang dan 14 TBS mentah. Data uji yang digunakan berjumlah 77 TBS yang terdiri dari 38 matang dan 39 mentah. Pada penelitian ini, <em>Support Vector Machine</em> (SVM) digunakan sebagai metode klasifikasi. Akurasi data latih yang diperoleh adalah 66,67%. Sedangkan akurasi data uji dari sistem yang dikembangkan dalam penelitian ini adalah 57,14%. Hasil yang diperoleh ini masih perlu diperbaiki untuk meningkatkan akurasi sistem dengan cara menambah jumlah data, baik data latih maupun uji, serta menggunakan pembelajaran mesin.</p><p class="Abstrak"> </p><p class="Abstrak"><strong><em>Abstract</em></strong></p><p class="Abstrak"><em>One of the main problems in the palm oil industry is the grading of Fresh Fruit Bunches (FFB) in the palm oil mills. The parameter used for the process is the number of fruitlets detached from the bunch. Nowadays, the FFB grading is conducted by graders which is subjective and often inconsistent due to the limitation of human vision and ability to process information on the number of fruitlets detached per FFB in a very limited time. Therefore, this study developed a grading system to assess and estimate the FFB maturity based on spectroscopy and image contrast value. From the literature review, visible light and NIR spectrum in 680 and 780 nm can be used as light sources to detect the maturity level of FFB. DSLR camera is used to acquire the FFB image. Using this scheme, two FFB images in 680 and 750 nm are obtained. The next process is to calculate the image contrast. In this research, there are 24 FFB that are used as training data that consists of 10 ripe and 14 unripe. A total of 77 FFB are used as test data that consists of 38 ripe and 39 unripe. Support Vector Machine (SVM) is used in this research to classify the maturity level of FFB. The accuracy of the training dataset is 66.67%. Meanwhile, the accuracy of the test data is 57.14%. Future works will focus on enhancing accuracy of the system through increasing the number of training and testing data using machine learning.</em></p>


2020 ◽  
Vol 4 (3) ◽  
pp. 504-512
Author(s):  
Faried Zamachsari ◽  
Gabriel Vangeran Saragih ◽  
Susafa'ati ◽  
Windu Gata

The decision to move Indonesia's capital city to East Kalimantan received mixed responses on social media. When the poverty rate is still high and the country's finances are difficult to be a factor in disapproval of the relocation of the national capital. Twitter as one of the popular social media, is used by the public to express these opinions. How is the tendency of community responses related to the move of the National Capital and how to do public opinion sentiment analysis related to the move of the National Capital with Feature Selection Naive Bayes Algorithm and Support Vector Machine to get the highest accuracy value is the goal in this study. Sentiment analysis data will take from public opinion using Indonesian from Twitter social media tweets in a crawling manner. Search words used are #IbuKotaBaru and #PindahIbuKota. The stages of the research consisted of collecting data through social media Twitter, polarity, preprocessing consisting of the process of transform case, cleansing, tokenizing, filtering and stemming. The use of feature selection to increase the accuracy value will then enter the ratio that has been determined to be used by data testing and training. The next step is the comparison between the Support Vector Machine and Naive Bayes methods to determine which method is more accurate. In the data period above it was found 24.26% positive sentiment 75.74% negative sentiment related to the move of a new capital city. Accuracy results using Rapid Miner software, the best accuracy value of Naive Bayes with Feature Selection is at a ratio of 9:1 with an accuracy of 88.24% while the best accuracy results Support Vector Machine with Feature Selection is at a ratio of 5:5 with an accuracy of 78.77%.


2019 ◽  
Vol 6 (5) ◽  
pp. 190001 ◽  
Author(s):  
Katherine E. Klug ◽  
Christian M. Jennings ◽  
Nicholas Lytal ◽  
Lingling An ◽  
Jeong-Yeol Yoon

A straightforward method for classifying heavy metal ions in water is proposed using statistical classification and clustering techniques from non-specific microparticle scattering data. A set of carboxylated polystyrene microparticles of sizes 0.91, 0.75 and 0.40 µm was mixed with the solutions of nine heavy metal ions and two control cations, and scattering measurements were collected at two angles optimized for scattering from non-aggregated and aggregated particles. Classification of these observations was conducted and compared among several machine learning techniques, including linear discriminant analysis, support vector machine analysis, K-means clustering and K-medians clustering. This study found the highest classification accuracy using the linear discriminant and support vector machine analysis, each reporting high classification rates for heavy metal ions with respect to the model. This may be attributed to moderate correlation between detection angle and particle size. These classification models provide reasonable discrimination between most ion species, with the highest distinction seen for Pb(II), Cd(II), Ni(II) and Co(II), followed by Fe(II) and Fe(III), potentially due to its known sorption with carboxyl groups. The support vector machine analysis was also applied to three different mixture solutions representing leaching from pipes and mine tailings, and showed good correlation with single-species data, specifically with Pb(II) and Ni(II). With more expansive training data and further processing, this method shows promise for low-cost and portable heavy metal identification and sensing.


Sign in / Sign up

Export Citation Format

Share Document