Improving the Accuracy of Text Classification using Stemming Method, A Case of Informal Indonesian Conversation

Abstract As social beings, humans always interact with one another using either verbal or non-verbal language. Language is an arbitrary sound-symbol system, which is used by members of a community to cooperate, interact, and identify themselves. Indonesian language is classified into two categories, namely formal and non-formal. The former meets the grammatical standard as prescribed by linguistic rules of the language, while the latter tends to deviate it. In daily communication, however, non-formal language is more intensively used because they are more practical and easier to understand. With this tendency, non-formal language causes problems in linguistic computation because most linguistic computations use formal standard languages that already have standardized rules. This research aims to develop a dynamic Indonesian closed corpus related to airline ticket reservation, namely "Incorbiz". The "Incorbiz" will be used as stemming tool for formal and non-formal Indonesian. Text processing, text normalization, and auto-update data were proposed in this research. This research also compared two stemming techniques i.e. "Sastrawi" and "Incorbiz" to process the 30-sample dataset. The algorithm used to process the classification is Support Vector Machine (SVM). The data used to develop the "Incorbiz" were taken from conversations between customer service staff and consumers in airline ticket reservations. The result showed that "Incorbiz" had higher accuracy than "Sastrawi" on 0.89 and 0.67, respectively.

Download Full-text

Improving the Accuracy of Text Classification using Stemming Method, A Case of Non-formal Indonesian Conversation

10.21203/rs.3.rs-41431/v2 ◽

2020 ◽

Author(s):

Rianto Rianto ◽

Achmad Benny Mutiara ◽

Eri Prasetyo Wibowo ◽

Paulus Insap Santosa

Keyword(s):

Support Vector Machine ◽

Information Retrieval ◽

Text Classification ◽

Experimental Evaluation ◽

Hate Speech ◽

Text Processing ◽

High Accuracy ◽

Support Vector ◽

Support Vector Machine Algorithm ◽

Text Data

Abstract Stemming has long been used in data pre-processing in information retrieval, which aims to make affix words into root words. However, there are not many stemming methods for non-formal Indonesian text processing. The existing stemming method has high accuracy for formal Indonesian, but low for non-formal Indonesian. Thus, the stemming method which has high accuracy for non-formal Indonesian classifier model is still an open-ended challenge. This study introduces a new stemming method to solve problems in the non-formal Indonesian text data pre-processing. Furthermore, this study aims to provide comprehensive research on improving the accuracy of text classifier models by strengthening on stemming method. Using the Support Vector Machine algorithm, a text classifier model is developed, and its accuracy is checked. The experimental evaluation was done by testing 550 datasets in Indonesian using two different stemming methods. The results show that using the proposed stemming method, the text classifier model has higher accuracy than the existing methods with a score of 0.85 and 0.73, respectively. In the future, the proposed stemming method can be used to develop the Indonesian text classifier model which can be used for various purposes including text clustering, summarization, detecting hate speech, and other text processing applications.

Download Full-text

Design of Intelligent Customer Service Report System Based on Automatic Speech Recognition and Text Classification

E3S Web of Conferences ◽

10.1051/e3sconf/202129501064 ◽

2021 ◽

Vol 295 ◽

pp. 01064

Author(s):

Yunlong Zou ◽

Xiangyu Liu ◽

Hongyan Xu ◽

Yingzhe Hou ◽

Jialiang Qi

Keyword(s):

Speech Recognition ◽

Knowledge Base ◽

Customer Service ◽

Automatic Speech Recognition ◽

Text Classification ◽

Difficult Problem ◽

Intelligent Network ◽

Dynamic Tracking ◽

Service Staff ◽

Report System

In combination with features such as intensive labor and speech in the customer service report field, this paper discusses the design of a customer service report system based on artificial intelligence automatic speech recognition technology and big data text classification technology. The proposed system realizes functions like a flat IVR menu, quick transcription and input of work orders, dynamic tracking of failure hotspots, automatic classification and accumulation of the knowledge base, speech emotion detection and real-time supervision of service quality, and it can improve the user experience and reduce the labor strengths of customer service staff. The automatically accumulated knowledge base can further assist with feedback to resolve the difficult problem that the emerging intelligent network Q&A and intelligent robots rely on a manually summarized knowledge base.

Download Full-text

Improving the Accuracy of Text Classification using Stemming Method, A Case of Non-Formal Indonesian Conversation

10.21203/rs.3.rs-41431/v3 ◽

2021 ◽

Author(s):

Rianto Rianto ◽

Achmad Benny Mutiara ◽

Eri Prasetyo Wibowo ◽

Paulus Insap Santosa

Keyword(s):

Text Classification ◽

Hate Speech ◽

Text Processing ◽

High Accuracy ◽

Small Error ◽

Support Vector ◽

Support Vector Machine Algorithm ◽

Text Data ◽

Accuracy Level ◽

The Impact

Abstract Background: Stemming has long been used in data pre-processing to retrieve information by tracking affixed words back into their root. In an Indonesian setting, existing stemming methods have been observed, and the existing stemming methods are proven to result in high accuracy level. However, there are not many stemming methods for non-formal Indonesian text processing. This study introduces a new stemming method to solve problems in the non-formal Indonesian text data pre-processing. Furthermore, this study aims to improve the accuracy of text classifier models by strengthening stemming method. Using the Support Vector Machine algorithm, a text classifier model is developed, and its accuracy is checked. The experimental evaluation was done by testing 550 datasets in Indonesian using two different stemming methods. Findings: The results show that using the proposed stemming method, the text classifier model has higher accuracy than the existing methods with a score of 0.85 and 0.73, respectively. These results indicate that the proposed stemming methods produces a classifier model with a small error rate, so it will be more accurate to predict a class of objects. Conclusion: The existing Indonesian stemming methods are still oriented towards Indonesian formal sentences, therefore the method has limitations to be used in Indonesian non-formal sentences. This phenomenon underlies the suggestion of developing a corpus by normalizing Indonesian non-formal into formal to be used as a better stemming method. The impact of using the corpus as a stemming method is that it can improve the accuracy of the classifier model. In the future, the proposed corpus and stemming methods can be used for various purposes including text clustering, summarizing, detecting hate speech, and other text processing applications in Indonesian.

Download Full-text

Support Vector Machines and Kernel Functions for Text Processing

Revista de Informática Teórica e Aplicada ◽

10.22456/2175-2745.39702 ◽

2013 ◽

Vol 20 (3) ◽

pp. 130 ◽

Cited By ~ 2

Author(s):

Celso Antonio Alves Kaestner

Keyword(s):

Text Classification ◽

Learning Algorithm ◽

Text Processing ◽

Dimensional Space ◽

Kernel Functions ◽

Support Vector ◽

Svm Classifier ◽

Vector Machines ◽

Automatic Text Classification ◽

Automatic Text

This work presents kernel functions that can be used in conjunction with the Support Vector Machine – SVM – learning algorithm to solve the automatic text classification task. Initially the Vector Space Model for text processing is presented. According to this model text is seen as a set of vectors in a high dimensional space; then extensions and alternative models are derived, and some preprocessing procedures are discussed. The SVM learning algorithm, largely employed for text classification, is outlined: its decision procedure is obtained as a solution of an optimization problem. The “kernel trick”, that allows the algorithm to be applied in non-linearly separable cases, is presented, as well as some kernel functions that are currently used in text applications. Finally some text classification experiments employing the SVM classifier are conducted, in order to illustrate some text preprocessing techniques and the presented kernel functions.

Download Full-text

Improving the accuracy of text classification using stemming method, a case of non-formal Indonesian conversation

Journal Of Big Data ◽

10.1186/s40537-021-00413-1 ◽

2021 ◽

Vol 8 (1) ◽

Author(s):

Rianto ◽

Achmad Benny Mutiara ◽

Eri Prasetyo Wibowo ◽

Paulus Insap Santosa

Keyword(s):

Text Classification ◽

Hate Speech ◽

Text Processing ◽

High Accuracy ◽

Small Error ◽

Support Vector ◽

Support Vector Machine Algorithm ◽

Text Data ◽

Accuracy Level ◽

The Impact

Abstract Background Stemming has long been used in data pre-processing to retrieve information by tracking affixed words back into their root. In an Indonesian setting, existing stemming methods have been observed, and the existing stemming methods are proven to result in high accuracy level. However, there are not many stemming methods for non-formal Indonesian text processing. This study introduces a new stemming method to solve problems in the non-formal Indonesian text data pre-processing. Furthermore, this study aims to improve the accuracy of text classifier models by strengthening stemming method. Using the Support Vector Machine algorithm, a text classifier model is developed, and its accuracy is checked. The experimental evaluation was done by testing 550 datasets in Indonesian using two different stemming methods. Findings The results show that using the proposed stemming method, the text classifier model has higher accuracy than the existing methods with a score of 0.85 and 0.73, respectively. These results indicate that the proposed stemming methods produces a classifier model with a small error rate, so it will be more accurate to predict a class of objects. Conclusion The existing Indonesian stemming methods are still oriented towards Indonesian formal sentences, therefore the method has limitations to be used in Indonesian non-formal sentences. This phenomenon underlies the suggestion of developing a corpus by normalizing Indonesian non-formal into formal to be used as a better stemming method. The impact of using the corpus as a stemming method is that it can improve the accuracy of the classifier model. In the future, the proposed corpus and stemming methods can be used for various purposes including text clustering, summarizing, detecting hate speech, and other text processing applications in Indonesian.

Download Full-text

Author identification of short texts using dependency treebanks without vocabulary

Digital Scholarship in the Humanities ◽

10.1093/llc/fqz070 ◽

2019 ◽

Vol 35 (4) ◽

pp. 812-825 ◽

Cited By ~ 1

Author(s):

Robert Gorman

Keyword(s):

Text Classification ◽

Authorship Attribution ◽

Support Vector ◽

Ancient Greek ◽

Combinatorial Explosion ◽

Independent Variables ◽

Author Identification ◽

Important Addition ◽

Digital Methods ◽

And Control

Abstract How to classify short texts effectively remains an important question in computational stylometry. This study presents the results of an experiment involving authorship attribution of ancient Greek texts. These texts were chosen to explore the effectiveness of digital methods as a supplement to the author’s work on text classification based on traditional stylometry. Here it is crucial to avoid confounding effects of shared topic, etc. Therefore, this study attempts to identify authorship using only morpho-syntactic data without regard to specific vocabulary items. The data are taken from the dependency annotations published in the Ancient Greek and Latin Dependency Treebank. The independent variables for classification are combinations generated from the dependency label and the morphology of each word in the corpus and its dependency parent. To avoid the effects of the combinatorial explosion, only the most frequent combinations are retained as input features. The authorship classification (with thirteen classes) is done with standard algorithms—logistic regression and support vector classification. During classification, the corpus is partitioned into increasingly smaller ‘texts’. To explore and control for the possible confounding effects of, e.g. different genre and annotator, three corpora were tested: a mixed corpus of several genres of both prose and verse, a corpus of prose including oratory, history, and essay, and a corpus restricted to narrative history. Results are surprisingly good as compared to those previously published. Accuracy for fifty-word inputs is 84.2–89.6%. Thus, this approach may prove an important addition to the prevailing methods for small text classification.

Download Full-text

Text classification on the Instagram caption using support vector machine

Journal of Physics Conference Series ◽

10.1088/1742-6596/1722/1/012023 ◽

2021 ◽

Vol 1722 ◽

pp. 012023

Author(s):

P P Ramadhani ◽

S Hadi

Keyword(s):

Support Vector Machine ◽

Text Classification ◽

Support Vector

Download Full-text

Support Vector Machine VS Information Gain: Analisis Sentimen Cyberbullying di Twitter Indonesia

Jurnal ULTIMA InfoSys ◽

10.31937/si.v11i2.1740 ◽

2020 ◽

Vol 11 (2) ◽

pp. 107-111

Author(s):

Christevan Destitus ◽

Wella Wella ◽

Suryasari Suryasari

Keyword(s):

Support Vector Machine ◽

Feature Selection ◽

Text Mining ◽

Information Gain ◽

Text Processing ◽

Support Vector ◽

Term Weighting ◽

System Process ◽

Research Stage

This study aims to clarify tweets on twitter using the Support Vector Machine and Information Gain methods. The clarification itself aims to find a hyperplane that separates the negative and positive classes. In the research stage, there is a system process, namely text mining, text processing which has stages of tokenizing, filtering, stemming, and term weighting. After that, a feature selection is made by information gain which calculates the entropy value of each word. After that, clarify based on the features that have been selected and the output is in the form of identifying whether the tweet is bully or not. The results of this study found that the Support Vector Machine and Information Gain methods have sufficiently maximum results.

Download Full-text

Headnote Prediction Using Machine Learning

The International Arab Journal of Information Technology ◽

10.34028/iajit/18/5/7 ◽

2021 ◽

Vol 18 (5) ◽

Author(s):

Sarmad Mahar ◽

Sahar Zafar ◽

Kamran Nishat

Keyword(s):

Machine Learning ◽

Feature Extraction ◽

Active Learning ◽

Text Classification ◽

Extraction Methods ◽

Text Summarization ◽

Training Data ◽

Second Step ◽

Support Vector ◽

Classification Algorithms

Headnotes are the precise explanation and summary of legal points in an issued judgment. Law journals hire experienced lawyers to write these headnotes. These headnotes help the reader quickly determine the issue discussed in the case. Headnotes comprise two parts. The first part comprises the topic discussed in the judgment, and the second part contains a summary of that judgment. In this thesis, we design, develop and evaluate headnote prediction using machine learning, without involving human involvement. We divided this task into a two steps process. In the first step, we predict law points used in the judgment by using text classification algorithms. The second step generates a summary of the judgment using text summarization techniques. To achieve this task, we created a Databank by extracting data from different law sources in Pakistan. We labelled training data generated based on Pakistan law websites. We tested different feature extraction methods on judiciary data to improve our system. Using these feature extraction methods, we developed a dictionary of terminology for ease of reference and utility. Our approach achieves 65% accuracy by using Linear Support Vector Classification with tri-gram and without stemmer. Using active learning our system can continuously improve the accuracy with the increased labelled examples provided by the users of the system.

Download Full-text

A Kernel-Based Approach for Biomedical Named Entity Recognition

The Scientific World JOURNAL ◽

10.1155/2013/950796 ◽

2013 ◽

Vol 2013 ◽

pp. 1-7 ◽

Cited By ~ 8

Author(s):

Rakesh Patra ◽

Sujan Kumar Saha

Keyword(s):

Kernel Function ◽

Text Processing ◽

Named Entity Recognition ◽

Kernel Functions ◽

Entity Recognition ◽

Machine Learning Techniques ◽

Support Vector ◽

Svm Classifier ◽

Named Entity ◽

Tree Kernel

Support vector machine (SVM) is one of the popular machine learning techniques used in various text processing tasks including named entity recognition (NER). The performance of the SVM classifier largely depends on the appropriateness of the kernel function. In the last few years a number of task-specific kernel functions have been proposed and used in various text processing tasks, for example, string kernel, graph kernel, tree kernel and so on. So far very few efforts have been devoted to the development of NER task specific kernel. In the literature we found that the tree kernel has been used in NER task only for entity boundary detection or reannotation. The conventional tree kernel is unable to execute the complete NER task on its own. In this paper we have proposed a kernel function, motivated by the tree kernel, which is able to perform the complete NER task. To examine the effectiveness of the proposed kernel, we have applied the kernel function on the openly available JNLPBA 2004 data. Our kernel executes the complete NER task and achieves reasonable accuracy.

Download Full-text