stop word
Recently Published Documents


TOTAL DOCUMENTS

75
(FIVE YEARS 39)

H-INDEX

7
(FIVE YEARS 1)

Author(s):  
Neha Garg ◽  
Kamlesh Sharma

<span>Sentiment analysis (SA) is an enduring area for research especially in the field of text analysis. Text pre-processing is an important aspect to perform SA accurately. This paper presents a text processing model for SA, using natural language processing techniques for twitter data. The basic phases for machine learning are text collection, text cleaning, pre-processing, feature extractions in a text and then categorize the data according to the SA techniques. Keeping the focus on twitter data, the data is extracted in domain specific manner. In data cleaning phase, noisy data, missing data, punctuation, tags and emoticons have been considered. For pre-processing, tokenization is performed which is followed by stop word removal (SWR). The proposed article provides an insight of the techniques, that are used for text pre-processing, the impact of their presence on the dataset. The accuracy of classification techniques has been improved after applying text pre-processing and dimensionality has been reduced. The proposed corpus can be utilized in the area of market analysis, customer behaviour, polling analysis, and brand monitoring. The text pre-processing process can serve as the baseline to apply predictive analysis, machine learning and deep learning algorithms which can be extended according to problem definition.</span>


2021 ◽  
Vol 1 (1) ◽  
pp. 363-367
Author(s):  
Yuli Fauziah ◽  
Bambang Yuwono ◽  
Agus Sasmito Aribowo

This systematic literature review aims to determine the trend of lexicon based sentiment analysis research in Indonesian Language in the last two years. The focus of the study is on the understanding of preprocessing used in lexicon-based sentiment analysis studies in the last two years, the lexicon used in these studies, and classification accuracy. The main question in this SLR : what techniques of lexicon based sentiment analysis will provide the highest accuracy. The most widely used preprocessing methods in previous research are tokenization, case conversion, stemming, remove punctuation, remove stop word, remove or replace emoji and emoticons, and normalization or slangword conversion. The sentiment labeling process in previous studies calculated based on the comparison of the number of negative sentiment keywords with positive sentiment keywords in one sentence. The maximum accuracy from previous study is 90%. The most widely used lexicon is NRC and Inset which is a lexicon dictionary in Indonesian. Knowledge of this can be used to propose a better model for lexicon based sentiment analysis in Indonesian Languages.


2021 ◽  
Author(s):  
Sheela J ◽  
Janet B

Abstract This paper proposes a multi-document summarization model using an optimization algorithm named CAVIAR Sun Flower Optimization (CAV-SFO). In this method, two classifiers, namely: Generative Adversarial Network (GAN) classifier and Deep Recurrent Neural Network (Deep RNN), are utilized to generate a score for summarizing multi-documents. Initially, the simHash method is applied for removing the duplicate/real duplicate contents from sentences. Then, the result is given to the proposed CAV-SFO based GAN classifier to determine the score for individual sentences. The CAV-SFO is newly designed by incorporating CAVIAR with Sun Flower Optimization Algorithm (SFO). On the other hand, the pre-processing step is done for duplicate-removed sentences from input multi-document based on stop word removal and stemming. Afterward, text-based features are extracted from pre-processed documents, and then CAV-SFO based Deep RNN is introduced for generating a score; thereby, the internal model parameters are optimally tuned. Finally, the score generated by CAV-SFO based GAN and CAV-SFO based Deep RNN is hybridized, and the final score is obtained using a multi-document compression ratio. The proposed TaylorALO-based GAN showed improved results with maximal precision of 0.989, maximal recall of 0.986, maximal F-Measure of 0.823, maximal Rouge-Precision of 0.930, and maximal Rouge-recall of 0.870.


2021 ◽  
Vol 9 (2) ◽  
pp. 244-252
Author(s):  
Rizka Safitri Lutfiyani ◽  
Niken Retnowati

Email cukup populer sebagai salah satu media komunikasi digital. Hal tersebut dikarenakan proses pengiriman pesan dengan email yang mudah. Sayangnya, kebanyakan pesan dalam email adalah email spam. Spam adalah pesan yang tidak diinginkan penerima pesan karena spam biasanya berisi pesan iklan maupun pesan penipuan. Ham adalah pesan yang diinginkan penerima pesan. Salah satu cara untuk menyortir pesan-pesan tersebut adalah dengan melakukan pengklasifikasian pesan email menjadi spam maupun ham. Naïve Bayes dan decision tree J48 ialah algoritma yang dapat digunakan untuk mengklasifikasikan pesan email. Oleh karena itu, penelitian ini bertujuan membandingkan efektifitas algoritma Naïve Bayes dan decision tree J48 dalam penyortiran email spam. Metode yang digunakan adalah text mining. Data yang berisi teks pesan email berbahasa Inggris akan diproses terlebih dahulu sebelum diklasifikasikan dengan Naïve Bayes dan decision tree J48. Tahap pra proses tersebut meliputi tokenisasi, pembuangan stop word list, stemming, dan seleksi atribut. Selanjutnya, data teks pesan email akan diproses dengan algoritma Naïve Bayes dan decision tree J48. Algoritma Naïve Bayes adalah algoritma pengklasifikasi yang berdasarkan pada teori keputusan Bayesian sedangkan algoritma decision tree J48 ialah pengembangan dari algoritma decision tree ID3. Hasil penelitian ini adalah algoritma decision tree J48 mendapat akurasi yang lebih tingggi dari algoritma Naïve Bayes. Algoritma decision tree J48 mendapat 93,117% sedangkan Naïve Beyes memiliki akurasi 88,5284%. Kesimpulan dari penelitian ini adalah algoritma decision tree J48 lebih unggul dibanding Naive Bayes untuk menyortir email spam jika dilihat dari tingkat akurasi masing-masing algoritma.


Author(s):  
Kanika Sharma

Abstract: Any story or any other literary content is best understood and advertised with the help of pictures. Images are used to arouse reader’s interest and comprehension in the content. The contextual image illustrator will take any content description and will output the ranked images related to that content. The text can be any blog, newspaper article, any story or any other content. The image retrieval process that has been used for this purpose is Text based Image Retrieval, i.e., TBIR. Semantic keywords are extricated from the story; images are looked through an annotated database. Thereafter, an image ranking scheme will determine the relevance of each image. Then the user can choose among the images displayed. A score along with each image will also be displayed representing its relevance to the query. The keywords stemming and stop word removal has been explained in the document. Also, the algorithm that has been designed to determine the score and hence the image’s significance has been calculated. Testing consisting of both unit testing and module testing of the project are explained. Keywords: Keyword Extraction, Image Search, Stemming, Stop word Removal, URL Score, URL Ranking


2021 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Sireesha Jasti

Purpose Internet has endorsed a tremendous change with the advancement of the new technologies. The change has made the users of the internet to make comments regarding the service or product. The Sentiment classification is the process of analyzing the reviews for helping the user to decide whether to purchase the product or not. Design/methodology/approach A rider feedback artificial tree optimization-enabled deep recurrent neural networks (RFATO-enabled deep RNN) is developed for the effective classification of sentiments into various grades. The proposed RFATO algorithm is modeled by integrating the feedback artificial tree (FAT) algorithm in the rider optimization algorithm (ROA), which is used for training the deep RNN classifier for the classification of sentiments in the review data. The pre-processing is performed by the stemming and the stop word removal process for removing the redundancy for smoother processing of the data. The features including the sentiwordnet-based features, a variant of term frequency-inverse document frequency (TF-IDF) features and spam words-based features are extracted from the review data to form the feature vector. Feature fusion is performed based on the entropy of the features that are extracted. The metrics employed for the evaluation in the proposed RFATO algorithm are accuracy, sensitivity, and specificity. Findings By using the proposed RFATO algorithm, the evaluation metrics such as accuracy, sensitivity and specificity are maximized when compared to the existing algorithms. Originality/value The proposed RFATO algorithm is modeled by integrating the FAT algorithm in the ROA, which is used for training the deep RNN classifier for the classification of sentiments in the review data. The pre-processing is performed by the stemming and the stop word removal process for removing the redundancy for smoother processing of the data. The features including the sentiwordnet-based features, a variant of TF-IDF features and spam words-based features are extracted from the review data to form the feature vector. Feature fusion is performed based on the entropy of the features that are extracted.


Author(s):  
Jannatul Ferdousi Sohana ◽  
Ranak Jahan Rupa ◽  
Moqsadur Rahman
Keyword(s):  

Author(s):  
Doo-San Kim ◽  
Byeong-Cheol Lee ◽  
Kwang-Hi Park

Despite the unique characteristics of urban forests, the motivating factors of urban forest visitors have not been clearly differentiated from other types of the forest resource. This study aims to identify the motivating factors of urban forest visitors, using latent Dirichlet allocation (LDA) topic modeling based on social big data. A total of 57,449 cases of social text data from social blogs containing the keyword “urban forest” were collected from Naver and Daum, the major search engines in South Korea. Then, 17,229 cases were excluded using morpheme analysis and stop word elimination; 40,110 cases were analyzed to identify the motivating factors of urban forest visitors through LDA topic modeling. Seven motivating factors—“Cafe-related Walk”, “Healing Trip”, “Daily Leisure”, “Family Trip”, “Wonderful View”, “Clean Space”, and “Exhibition and Photography”—were extracted; each contained five keywords. This study elucidates the role of forests as a place for healing, leisure, and daily exercise. The results suggest that efforts should be made toward developing various programs regarding the basic functionality of urban forests as a natural resource and a unique place to support a diversity of leisure and cultural activities.


Author(s):  
Gokul Yenduri ◽  
B. R. Rajakumar ◽  
K. Praghash ◽  
D. Binu

The identification of opinions and sentiments from tweets is termed as “Twitter Sentiment Analysis (TSA)”. The major process of TSA is to determine the sentiment or polarity of the tweet and then classifying them into a negative or positive tweet. There are several methods introduced for carrying out TSA, however, it remains to be challenging due to slang words, modern accents, grammatical and spelling mistakes, and other issues that could not be solved by existing techniques. This work develops a novel customized BERT-oriented sentiment classification that encompasses two main phases: pre-processing and tokenization, and a “Customized Bidirectional Encoder Representations from Transformers (BERT)”-based classification. At first, the gathered raw tweets are pre-processed under stop-word removal, stemming and blank space removal. After pre-processing, the semantic words are obtained, from which the meaningful words (tokens) are extracted in the tokenization phase. Consequently, these extracted tokens are classified via optimized BERT, where biases and weight are tuned optimally by Particle-Assisted Circle Updating Position (PA-CUP). Moreover, the maximal sequence length of the BERT encoder is updated using standard PA-CUP. Finally, the performance analysis is carried out to substantiate the enhancement of the proposed model.


2021 ◽  
Author(s):  
Perumal P ◽  
Mathivanan B

Abstract The automatic document clustering and topic extraction from the corpus provides a very essential requirement in many real time applications. The document clustering and topic detection is utilized to locating data quickly. Hence, in this paper, Type 2 Intuitionistic Fuzzy Clustering and Seagull Optimization Algorithm (Type 2 IFCSOA) is developed for document clustering and topic detection. The Type 2 IFCSOA is utilized to cluster the documents. Additionally, ensemble approach is utilized to identify by the topics from the clustered documents. In the proposed methodology, the pre-processing is utilized to remove unwanted information from the documents such as tokenization, stop word removal and stemming process. After that, the proposed method is utilized to cluster the documents. The clustered documents are labeled with the basis of clusters. After that, to achieve topic detection, the ensemble approach is utilized with feature extraction phases such as Term Frequency- Inverse Document Frequency (TF-IDF), Mutual information (MI), Text Rank Algorithm and analysis of keyword taking out from co-occurrence statistical -Information (CSI). The proposed methodology is implemented in MATLAB and performances were evaluated with the statistical measurements such as precision, recall, accuracy, sensitivity, purity measure and entropy. The proposed method is compared with the conventional methods such as Fuzzy C Means clustering (FCM), FCM-Particle Swarm Optimization (PSO), FCM-Genetic Algorithm (GA) and K means clustering.


Sign in / Sign up

Export Citation Format

Share Document