scholarly journals Improving the accuracy of text classification using stemming method, a case of non-formal Indonesian conversation

2021 ◽  
Vol 8 (1) ◽  
Author(s):  
Rianto ◽  
Achmad Benny Mutiara ◽  
Eri Prasetyo Wibowo ◽  
Paulus Insap Santosa

Abstract Background Stemming has long been used in data pre-processing to retrieve information by tracking affixed words back into their root. In an Indonesian setting, existing stemming methods have been observed, and the existing stemming methods are proven to result in high accuracy level. However, there are not many stemming methods for non-formal Indonesian text processing. This study introduces a new stemming method to solve problems in the non-formal Indonesian text data pre-processing. Furthermore, this study aims to improve the accuracy of text classifier models by strengthening stemming method. Using the Support Vector Machine algorithm, a text classifier model is developed, and its accuracy is checked. The experimental evaluation was done by testing 550 datasets in Indonesian using two different stemming methods. Findings The results show that using the proposed stemming method, the text classifier model has higher accuracy than the existing methods with a score of 0.85 and 0.73, respectively. These results indicate that the proposed stemming methods produces a classifier model with a small error rate, so it will be more accurate to predict a class of objects. Conclusion The existing Indonesian stemming methods are still oriented towards Indonesian formal sentences, therefore the method has limitations to be used in Indonesian non-formal sentences. This phenomenon underlies the suggestion of developing a corpus by normalizing Indonesian non-formal into formal to be used as a better stemming method. The impact of using the corpus as a stemming method is that it can improve the accuracy of the classifier model. In the future, the proposed corpus and stemming methods can be used for various purposes including text clustering, summarizing, detecting hate speech, and other text processing applications in Indonesian.

2021 ◽  
Author(s):  
Rianto Rianto ◽  
Achmad Benny Mutiara ◽  
Eri Prasetyo Wibowo ◽  
Paulus Insap Santosa

Abstract Background: Stemming has long been used in data pre-processing to retrieve information by tracking affixed words back into their root. In an Indonesian setting, existing stemming methods have been observed, and the existing stemming methods are proven to result in high accuracy level. However, there are not many stemming methods for non-formal Indonesian text processing. This study introduces a new stemming method to solve problems in the non-formal Indonesian text data pre-processing. Furthermore, this study aims to improve the accuracy of text classifier models by strengthening stemming method. Using the Support Vector Machine algorithm, a text classifier model is developed, and its accuracy is checked. The experimental evaluation was done by testing 550 datasets in Indonesian using two different stemming methods. Findings: The results show that using the proposed stemming method, the text classifier model has higher accuracy than the existing methods with a score of 0.85 and 0.73, respectively. These results indicate that the proposed stemming methods produces a classifier model with a small error rate, so it will be more accurate to predict a class of objects. Conclusion: The existing Indonesian stemming methods are still oriented towards Indonesian formal sentences, therefore the method has limitations to be used in Indonesian non-formal sentences. This phenomenon underlies the suggestion of developing a corpus by normalizing Indonesian non-formal into formal to be used as a better stemming method. The impact of using the corpus as a stemming method is that it can improve the accuracy of the classifier model. In the future, the proposed corpus and stemming methods can be used for various purposes including text clustering, summarizing, detecting hate speech, and other text processing applications in Indonesian.


2020 ◽  
Author(s):  
Rianto Rianto ◽  
Achmad Benny Mutiara ◽  
Eri Prasetyo Wibowo ◽  
Paulus Insap Santosa

Abstract Stemming has long been used in data pre-processing in information retrieval, which aims to make affix words into root words. However, there are not many stemming methods for non-formal Indonesian text processing. The existing stemming method has high accuracy for formal Indonesian, but low for non-formal Indonesian. Thus, the stemming method which has high accuracy for non-formal Indonesian classifier model is still an open-ended challenge. This study introduces a new stemming method to solve problems in the non-formal Indonesian text data pre-processing. Furthermore, this study aims to provide comprehensive research on improving the accuracy of text classifier models by strengthening on stemming method. Using the Support Vector Machine algorithm, a text classifier model is developed, and its accuracy is checked. The experimental evaluation was done by testing 550 datasets in Indonesian using two different stemming methods. The results show that using the proposed stemming method, the text classifier model has higher accuracy than the existing methods with a score of 0.85 and 0.73, respectively. In the future, the proposed stemming method can be used to develop the Indonesian text classifier model which can be used for various purposes including text clustering, summarization, detecting hate speech, and other text processing applications.


Water ◽  
2019 ◽  
Vol 11 (6) ◽  
pp. 1226 ◽  
Author(s):  
Mohammed Falah Allawi ◽  
Faridah Binti Othman ◽  
Haitham Abdulmohsin Afan ◽  
Ali Najah Ahmed ◽  
Md. Shabbir Hossain ◽  
...  

The current study explored the impact of climatic conditions on predicting evaporation from a reservoir. Several models have been developed for evaporation prediction under different scenarios, with artificial intelligence (AI) methods being the most popular. However, the existing models rely on several climatic parameters as inputs to achieve an acceptable accuracy level, some of which have been unavailable in certain case studies. In addition, the existing AI-based models for evaporation prediction have paid less attention to the influence of the time increment rate on the prediction accuracy level. This study investigated the ability of the radial basis function neural network (RBF-NN) and support vector regression (SVR) methods to develop an evaporation rate prediction model for a tropical area at the Layang Reservoir, Johor River, Malaysia. Two scenarios for input architecture were explored in order to examine the effectiveness of different input variable patterns on the model prediction accuracy. For the first scenario, the input architecture considered only the historical evaporation rate time series, while the mean temperature and evaporation rate were used as input variables for the second scenario. For both scenarios, three time-increment series (daily, weekly, and monthly) were considered.


In the last two decades, the amount of available Arabic text data on the World Wide Web is dramatically growing, making it the fourth most used language on the web. Accordingly, the demand for efficient Arabic text classification is increasing, especially for web page content filtering, information retrieval, and e-mail spam detection. Several Machine Learning algorithms have been implemented to classify Arabic documents. However, the results achieved are not comparable with those obtained in other languages such as English, primarily when using preprocessing techniques that do not take into consideration the Arabic language features. This paper investigates the impact of wisely selected preprocessing techniques on the efficiency of different text classification algorithms. The effects of stop words removal, stemming, lemmatization, and all possible combinations are examined. The reported results (+10.75% to +28.73%) prove the effectiveness of using these techniques either individually or in combination.


2020 ◽  
Author(s):  
Rianto Rianto ◽  
Achmad Benny Mutiara ◽  
Eri Prasetyo Wibowo ◽  
Paulus Insap Santosa

Abstract As social beings, humans always interact with one another using either verbal or non-verbal language. Language is an arbitrary sound-symbol system, which is used by members of a community to cooperate, interact, and identify themselves. Indonesian language is classified into two categories, namely formal and non-formal. The former meets the grammatical standard as prescribed by linguistic rules of the language, while the latter tends to deviate it. In daily communication, however, non-formal language is more intensively used because they are more practical and easier to understand. With this tendency, non-formal language causes problems in linguistic computation because most linguistic computations use formal standard languages that already have standardized rules. This research aims to develop a dynamic Indonesian closed corpus related to airline ticket reservation, namely "Incorbiz". The "Incorbiz" will be used as stemming tool for formal and non-formal Indonesian. Text processing, text normalization, and auto-update data were proposed in this research. This research also compared two stemming techniques i.e. "Sastrawi" and "Incorbiz" to process the 30-sample dataset. The algorithm used to process the classification is Support Vector Machine (SVM). The data used to develop the "Incorbiz" were taken from conversations between customer service staff and consumers in airline ticket reservations. The result showed that "Incorbiz" had higher accuracy than "Sastrawi" on 0.89 and 0.67, respectively.


Author(s):  
Muhammad Zulqarnain ◽  
Rozaida Ghazali ◽  
Yana Mazwin Mohmad Hassim ◽  
Muhammad Rehan

As the amount of unstructured text data that humanity produce largely and a lot of texts are grows on the Internet, so the one of the intelligent technique is require processing it and extracting different types of knowledge from it. Gated recurrent unit (GRU) and support vector machine (SVM) have been successfully used to Natural Language Processing (NLP) systems with comparative, remarkable results. GRU networks perform well in sequential learning tasks and overcome the issues of “vanishing and explosion of gradients in standard recurrent neural networks (RNNs) when captureing long-term dependencies. In this paper, we proposed a text classification model based on improved approaches to this norm by presenting a linear support vector machine (SVM) as the replacement of Softmax in the final output layer of a GRU model. Furthermore, the cross-entropy function shall be replaced with a margin-based function. Empirical results present that the proposed GRU-SVM model achieved comparatively better results than the baseline approaches BLSTM-C, DABN.


2021 ◽  
Vol 5 (1) ◽  
pp. 19-25
Author(s):  
Frizka Fitriana ◽  
Ema Utami ◽  
Hanif Al Fatta

The corona virus outbreak, commonly referred to as COVID-19, has been officially designated a global pandemic by the World Health Organization (WHO). To minimize the impact caused by the virus, one of the right steps is to develop a vaccine, however, with the vaccination for the Indonesian people, it is controversial so that it invites many people to give an opinion assessment, but the limited space makes it difficult for the public to express their opinion, because Therefore, people choose social media as a place to channel public opinion. Support vector machine algorithm has better performance in terms of accuracy, precision and recall with values ​​of 90.47%, 90.23%, 90.78% with performance values ​​on the Bayes algorithm, namely 88.64%, 87.32%, 88, 13%, with a difference of 1.83% accuracy, 2.91% precision and 2.65% recall, while for time the Naive Bayes algorithm has a better performance level with a value of 8.1 seconds and the Support vector machine algorithm gets a time speed of 11 seconds with a difference of 2, 9 seconds. With the results of sentiment analysis neutral 8.76%, negative 42.92% and positive 48.32% for Bayes and neutral 10.56%, negative 41.28% and positive 48.16% for SVM.


2019 ◽  
Vol 8 (4) ◽  
pp. 1333-1338

Text classification is a vital process due to the large volume of electronic articles. One of the drawbacks of text classification is the high dimensionality of feature space. Scholars developed several algorithms to choose relevant features from article text such as Chi-square (x2 ), Information Gain (IG), and Correlation (CFS). These algorithms have been investigated widely for English text, while studies for Arabic text are still limited. In this paper, we investigated four well-known algorithms: Support Vector Machines (SVMs), Naïve Bayes (NB), K-Nearest Neighbors (KNN), and Decision Tree against benchmark Arabic textual datasets, called Saudi Press Agency (SPA) to evaluate the impact of feature selection methods. Using the WEKA tool, we have experimented the application of the four mentioned classification algorithms with and without feature selection algorithms. The results provided clear evidence that the three feature selection methods often improves classification accuracy by eliminating irrelevant features.


Sign in / Sign up

Export Citation Format

Share Document