scholarly journals Improving the Accuracy of Text Classification using Stemming Method, A Case of Non-formal Indonesian Conversation

2020 ◽  
Author(s):  
Rianto Rianto ◽  
Achmad Benny Mutiara ◽  
Eri Prasetyo Wibowo ◽  
Paulus Insap Santosa

Abstract Stemming has long been used in data pre-processing in information retrieval, which aims to make affix words into root words. However, there are not many stemming methods for non-formal Indonesian text processing. The existing stemming method has high accuracy for formal Indonesian, but low for non-formal Indonesian. Thus, the stemming method which has high accuracy for non-formal Indonesian classifier model is still an open-ended challenge. This study introduces a new stemming method to solve problems in the non-formal Indonesian text data pre-processing. Furthermore, this study aims to provide comprehensive research on improving the accuracy of text classifier models by strengthening on stemming method. Using the Support Vector Machine algorithm, a text classifier model is developed, and its accuracy is checked. The experimental evaluation was done by testing 550 datasets in Indonesian using two different stemming methods. The results show that using the proposed stemming method, the text classifier model has higher accuracy than the existing methods with a score of 0.85 and 0.73, respectively. In the future, the proposed stemming method can be used to develop the Indonesian text classifier model which can be used for various purposes including text clustering, summarization, detecting hate speech, and other text processing applications.

2021 ◽  
Author(s):  
Rianto Rianto ◽  
Achmad Benny Mutiara ◽  
Eri Prasetyo Wibowo ◽  
Paulus Insap Santosa

Abstract Background: Stemming has long been used in data pre-processing to retrieve information by tracking affixed words back into their root. In an Indonesian setting, existing stemming methods have been observed, and the existing stemming methods are proven to result in high accuracy level. However, there are not many stemming methods for non-formal Indonesian text processing. This study introduces a new stemming method to solve problems in the non-formal Indonesian text data pre-processing. Furthermore, this study aims to improve the accuracy of text classifier models by strengthening stemming method. Using the Support Vector Machine algorithm, a text classifier model is developed, and its accuracy is checked. The experimental evaluation was done by testing 550 datasets in Indonesian using two different stemming methods. Findings: The results show that using the proposed stemming method, the text classifier model has higher accuracy than the existing methods with a score of 0.85 and 0.73, respectively. These results indicate that the proposed stemming methods produces a classifier model with a small error rate, so it will be more accurate to predict a class of objects. Conclusion: The existing Indonesian stemming methods are still oriented towards Indonesian formal sentences, therefore the method has limitations to be used in Indonesian non-formal sentences. This phenomenon underlies the suggestion of developing a corpus by normalizing Indonesian non-formal into formal to be used as a better stemming method. The impact of using the corpus as a stemming method is that it can improve the accuracy of the classifier model. In the future, the proposed corpus and stemming methods can be used for various purposes including text clustering, summarizing, detecting hate speech, and other text processing applications in Indonesian.


2021 ◽  
Vol 8 (1) ◽  
Author(s):  
Rianto ◽  
Achmad Benny Mutiara ◽  
Eri Prasetyo Wibowo ◽  
Paulus Insap Santosa

Abstract Background Stemming has long been used in data pre-processing to retrieve information by tracking affixed words back into their root. In an Indonesian setting, existing stemming methods have been observed, and the existing stemming methods are proven to result in high accuracy level. However, there are not many stemming methods for non-formal Indonesian text processing. This study introduces a new stemming method to solve problems in the non-formal Indonesian text data pre-processing. Furthermore, this study aims to improve the accuracy of text classifier models by strengthening stemming method. Using the Support Vector Machine algorithm, a text classifier model is developed, and its accuracy is checked. The experimental evaluation was done by testing 550 datasets in Indonesian using two different stemming methods. Findings The results show that using the proposed stemming method, the text classifier model has higher accuracy than the existing methods with a score of 0.85 and 0.73, respectively. These results indicate that the proposed stemming methods produces a classifier model with a small error rate, so it will be more accurate to predict a class of objects. Conclusion The existing Indonesian stemming methods are still oriented towards Indonesian formal sentences, therefore the method has limitations to be used in Indonesian non-formal sentences. This phenomenon underlies the suggestion of developing a corpus by normalizing Indonesian non-formal into formal to be used as a better stemming method. The impact of using the corpus as a stemming method is that it can improve the accuracy of the classifier model. In the future, the proposed corpus and stemming methods can be used for various purposes including text clustering, summarizing, detecting hate speech, and other text processing applications in Indonesian.


Author(s):  
Muhammad Zulqarnain ◽  
Rozaida Ghazali ◽  
Yana Mazwin Mohmad Hassim ◽  
Muhammad Rehan

As the amount of unstructured text data that humanity produce largely and a lot of texts are grows on the Internet, so the one of the intelligent technique is require processing it and extracting different types of knowledge from it. Gated recurrent unit (GRU) and support vector machine (SVM) have been successfully used to Natural Language Processing (NLP) systems with comparative, remarkable results. GRU networks perform well in sequential learning tasks and overcome the issues of “vanishing and explosion of gradients in standard recurrent neural networks (RNNs) when captureing long-term dependencies. In this paper, we proposed a text classification model based on improved approaches to this norm by presenting a linear support vector machine (SVM) as the replacement of Softmax in the final output layer of a GRU model. Furthermore, the cross-entropy function shall be replaced with a margin-based function. Empirical results present that the proposed GRU-SVM model achieved comparatively better results than the baseline approaches BLSTM-C, DABN.


Repositor ◽  
2020 ◽  
Vol 2 (12) ◽  
pp. 1623
Author(s):  
Muhammad Fadliansyah ◽  
Setio Basuki ◽  
Yufis Azhar

AbstrakTwitter merupakan salah satu sosial media yang paling banyak dipakai di Indonesia, tidak hanya sebagai sarana berbagi informasi terkait hal – hal pribadi tetapi juga bisa berupa opini terhadap suatu topik. Tidak hanya sebagai pusat infromasi, twitter juga bisa digunakan sebagai pusat data berupa teks. Pilkada DKI Jakarta 2017 merupakan salah satu topik yang menarik untuk di bahas. Tidak hanya sebagai penentu kepemimpinan Jakarta untuk 5 tahun kedepan, tetapi karena pengaruh yang dimilikinya terhadap beberapa sektor di Indoensia. Tweet yang membahas topik Pilkada DKI Jakarta 2017 bisa diolah untuk mendapatkan informasi yang berguna, misalnya sentimen yang terjadi selama peristiwa politik ini terjadi. Sentimen yang didapat bisa digunakan dalam prediksi harga saham selama masa Pilkada. Untuk bisa mendapatkan sentimen dari data teks dari twitter, sentiment anaylsis digunakan untuk mengekstrak informasi dari tweet yang sudah dikumpulkan. Untuk melakukan sentiment analysis, algoritma support vector machine dipakai untuk mengklasifikasikan tweet kedalam target kelas. Hasil dari klasifikasi sentimen digunakan sebagai salah satu pembobot dalam regresi linier untuk memprediksi harga saham. Hasil dari pengujian menunjukkan bahwa penggunaan sentimen Pilkada DKI Jakarta 2017 untuk memprediksi harga saham cukup baik. Dimana nilai RMSE yang didapat oleh masing-masing saham bervariasi karena saham-saham yang dipilih berasal dari sektor yang berbeda. BBRI 58.974, SRTG 101.188, WIKA 52.042, ADHI 93.420 dan APLN 17.342.Abstract Twitter is one of the most widely used social media in Indonesia, not only as a means of sharing information related to personal matters but also as information. Not only as a center of information, twitter can also be as central data in the form of text. DKI Jakarta Election 2017 is one of the interesting topics to discuss. Not only as a determinant of Jakarta's leadership for the next 5 years, but because of the influence it has had on several sectors in Indonesia. A Tweet that discusses the topic of the 2017 DKI Jakarta Regional Election can be processed to get useful information, for example sentiments that occur during times. Sentiment that can be done in the context of prices during the election period. To be able to get sentiments from text data from twitter, anaylsis sentiment is to extract information from tweets that have been collected. To do sentiment analysis, the support vector machine algorithm is used to classify tweets in the target class. Results from the basis of sentiment as one weight in linear regression to predict prices. The results of the test show that the use of the DKI Jakarta Regional Election sentiment 2017 is to predict the stock price to be quite good. Where is the RMSE value that can be found by each different sector. BBRI 58,974, SRTG 101,188, WIKA 52,042, ADHI 93,420 and APLN 17,342.


2019 ◽  
Vol 15 (2) ◽  
pp. 275-280
Author(s):  
Agus Setiyono ◽  
Hilman F Pardede

It is now common for a cellphone to receive spam messages. Great number of received messages making it difficult for human to classify those messages to Spam or no Spam.  One way to overcome this problem is to use Data Mining for automatic classifications. In this paper, we investigate various data mining techniques, named Support Vector Machine, Multinomial Naïve Bayes and Decision Tree for automatic spam detection. Our experimental results show that Support Vector Machine algorithm is the best algorithm over three evaluated algorithms. Support Vector Machine achieves 98.33%, while Multinomial Naïve Bayes achieves 98.13% and Decision Tree is at 97.10 % accuracy.


Sign in / Sign up

Export Citation Format

Share Document