Multi Class Data Classification to Improve Accuracy in Sentiment Analysis using Machine Learning

Daram Vishnu

doi:10.22214/ijraset.2021.35291

Multi Class Data Classification to Improve Accuracy in Sentiment Analysis using Machine Learning

International Journal for Research in Applied Science and Engineering Technology ◽

10.22214/ijraset.2021.35291 ◽

2021 ◽

Vol 9 (VI) ◽

pp. 1457-1461

Author(s):

Daram Vishnu

Keyword(s):

Machine Learning ◽

Sentiment Analysis ◽

Confusion Matrix ◽

Training Data ◽

Natural Languages ◽

Parts Of Speech ◽

Testing Data ◽

Improve Accuracy ◽

Textual Form ◽

Speech Tagging

Sentiment analysis means classifying a text into different emotional classes. These days most of the sentiment analysis techniques divide the text into either binary or ternary classification in this paper we are classifying the movie reviews into 5 classes. Multi class sentiment analysis is a technique which can be used to know the exact sentiment of a review not just polarity of a given textual statement from positive to negative. So that one can know the precise sentiment of a review . Multi class sentiment analysis has always been a challenging task as natural languages are difficult to represent mathematically. The number of features are also generally large which requires huge computational power so to reduce the number of features we will use parts-of-speech tagging using textblob to extract the important features. Sentiment analysis is done using machine learning, where it requires training data and testing data to train a model. Various kinds of models are trained and tested at last one model is selected based on its accuracy and confusion matrix. It is important to analyze the reviews in textual form because large amount of reviews is present all over the web. Analyzing textual reviews can help the firms that are trying to find out the response of their products in the market. In this paper sentiment analysis is demonstrated by analyzing the movie reviews, reviews are taken from IMDB website.

Download Full-text

A review: preprocessing techniques and data augmentation for sentiment analysis

Computational Social Networks ◽

10.1186/s40649-020-00080-x ◽

2021 ◽

Vol 8 (1) ◽

Author(s):

Huu-Thanh Duong ◽

Tram-Anh Nguyen-Thi

Keyword(s):

Machine Learning ◽

Sentiment Analysis ◽

Supervised Learning ◽

Data Augmentation ◽

Original Data ◽

Training Data ◽

Unseen Data ◽

Augmentation Techniques ◽

User Intervention

AbstractIn literature, the machine learning-based studies of sentiment analysis are usually supervised learning which must have pre-labeled datasets to be large enough in certain domains. Obviously, this task is tedious, expensive and time-consuming to build, and hard to handle unseen data. This paper has approached semi-supervised learning for Vietnamese sentiment analysis which has limited datasets. We have summarized many preprocessing techniques which were performed to clean and normalize data, negation handling, intensification handling to improve the performances. Moreover, data augmentation techniques, which generate new data from the original data to enrich training data without user intervention, have also been presented. In experiments, we have performed various aspects and obtained competitive results which may motivate the next propositions.

Download Full-text

Looking Under the Hood of Stochastic Machine Learning Algorithms for Parts of Speech Tagging

SSRN Electronic Journal ◽

10.2139/ssrn.2726830 ◽

2008 ◽

Author(s):

Jana Diesner ◽

Kathleen M. Carley

Keyword(s):

Machine Learning ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Parts Of Speech ◽

Speech Tagging

Download Full-text

A sentiment analysis system for social media using machine learning techniques: Social enablement

Digital Scholarship in the Humanities ◽

10.1093/llc/fqy037 ◽

2018 ◽

Vol 34 (3) ◽

pp. 569-581 ◽

Cited By ~ 1

Author(s):

Sujata Rani ◽

Parteek Kumar

Keyword(s):

Machine Learning ◽

Social Media ◽

Sentiment Analysis ◽

Media Analysis ◽

Training Data ◽

Machine Learning Techniques ◽

Support Vector ◽

Analysis Tool ◽

Data Set ◽

Learning Techniques

Abstract In this article, an innovative approach to perform the sentiment analysis (SA) has been presented. The proposed system handles the issues of Romanized or abbreviated text and spelling variations in the text to perform the sentiment analysis. The training data set of 3,000 movie reviews and tweets has been manually labeled by native speakers of Hindi in three classes, i.e. positive, negative, and neutral. The system uses WEKA (Waikato Environment for Knowledge Analysis) tool to convert these string data into numerical matrices and applies three machine learning techniques, i.e. Naive Bayes (NB), J48, and support vector machine (SVM). The proposed system has been tested on 100 movie reviews and tweets, and it has been observed that SVM has performed best in comparison to other classifiers, and it has an accuracy of 68% for movie reviews and 82% in case of tweets. The results of the proposed system are very promising and can be used in emerging applications like SA of product reviews and social media analysis. Additionally, the proposed system can be used in other cultural/social benefits like predicting/fighting human riots.

Download Full-text

SENTIMENT ANALYSIS OF ELECTRIC CARS USING RECURRENT NEURAL NETWORK METHOD IN INDONESIAN TWEETS

Kursor ◽

10.21107/kursor.v10i4.233 ◽

2020 ◽

Vol 10 (4) ◽

Author(s):

Felisia Handayani ◽

Metty Mustikasari

Keyword(s):

Neural Network ◽

Machine Learning ◽

Deep Learning ◽

Sentiment Analysis ◽

Recurrent Neural Network ◽

Confusion Matrix ◽

Data Representation ◽

Sequential Data ◽

Communication Tool ◽

Electric Cars

Sentiment analysis is computational research of the opinions of many people who are textually expressed against a particular topic. Twitter is the most popular communication tool among Internet users today to express their opinions. Deep Learning is a solution to allow computers to learn from experience and understand the world in terms of the hierarchy concept. Deep Learning objectives replace manual assignments with learning. The development of deep learning has a set of algorithms that focus on learning data representation. The recurrent Neural Network is one of the machine learning methods included in Deep learning because the data is processed through multi-players. RNN is also an algorithm that can recall the input with internal memory, therefore it is suitable for machine learning problems involving sequential data. The study aims to test models that have been created from tweets that are positive, negative, and neutral sentiment to determine the accuracy of the models. The models have been created using the Recurrent Neural Network when applied to tweet classifications to mark the individual classes of Indonesian-language tweet data sentiment. From the experiments conducted, results on the built system showed that the best test results in the tweet data with the RNN method using Confusion Matrix are with Precision 0.618, Recall 0.507 and Accuracy 0.722 on the data amounted to 3000 data and comparative data training and data testing of ratio data 80:20

Download Full-text

A Novel Sentiment Analysis for Amazon Data with TSA based Feature Selection

Scalable Computing Practice and Experience ◽

10.12694/scpe.v22i1.1839 ◽

2021 ◽

Vol 22 (1) ◽

pp. 53-66

Author(s):

D. Anand Joseph Daniel ◽

M. Janaki Meena

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Sentiment Analysis ◽

User Satisfaction ◽

Performance Metrics ◽

Computation Time ◽

Feature Reduction ◽

Training Data ◽

Product Reviews ◽

Online Product Reviews

Sentiment analysis of online product reviews has become a mainstream way for businesses on e-commerce platforms to promote their products and improve user satisfaction. Hence, it is necessary to construct an automatic sentiment analyser for automatic identification of sentiment polarity of the online product reviews. Traditional lexicon-based approaches used for sentiment analysis suffered from several accuracy issues while machine learning techniques require labelled training data. This paper introduces a hybrid sentiment analysis framework to bond the gap between both machine learning and lexicon-based approaches. A novel tunicate swarm algorithm (TSA) based feature reduction is integrated with the proposed hybrid method to solve the scalability issue that arises due to a large feature set. It reduces the feature set size to 43% without changing the accuracy (93%). Besides, it improves the scalability, reduces the computation time and enhances the overall performance of the proposed framework. From experimental analysis, it can be observed that TSA outperforms existing feature selection techniques such as particle swarm optimization and genetic algorithm. Moreover, the proposed approach is analysed with performance metrics such as recall, precision, F1-score, feature size and computation time.

Download Full-text

A machine learning framework to determine geolocations from metagenomic profiling

Biology Direct ◽

10.1186/s13062-020-00278-z ◽

2020 ◽

Vol 15 (1) ◽

Cited By ~ 1

Author(s):

Lihong Huang ◽

Canqiang Xu ◽

Wenxian Yang ◽

Rongshan Yu

Keyword(s):

Machine Learning ◽

Prediction Models ◽

Geographic Origin ◽

Training Data ◽

Metagenomic Data ◽

Training Dataset ◽

Kriging Interpolation ◽

Learning Framework ◽

Testing Data ◽

Microbial Samples

Abstract Background Studies on metagenomic data of environmental microbial samples found that microbial communities seem to be geolocation-specific, and the microbiome abundance profile can be a differentiating feature to identify samples’ geolocations. In this paper, we present a machine learning framework to determine the geolocations from metagenomics profiling of microbial samples. Results Our method was applied to the multi-source microbiome data from MetaSUB (The Metagenomics and Metadesign of Subways and Urban Biomes) International Consortium for the CAMDA 2019 Metagenomic Forensics Challenge (the Challenge). The goal of the Challenge is to predict the geographical origins of mystery samples by constructing microbiome fingerprints.First, we extracted features from metagenomic abundance profiles. We then randomly split the training data into training and validation sets and trained the prediction models on the training set. Prediction performance was evaluated on the validation set. By using logistic regression with L2 normalization, the prediction accuracy of the model reaches 86%, averaged over 100 random splits of training and validation datasets.The testing data consists of samples from cities that do not occur in the training data. To predict the “mystery” cities that are not sampled before for the testing data, we first defined biological coordinates for sampled cities based on the similarity of microbial samples from them. Then we performed affine transform on the map such that the distance between cities measures their biological difference rather than geographical distance. After that, we derived the probabilities of a given testing sample from unsampled cities based on its predicted probabilities on sampled cities using Kriging interpolation. Results show that this method can successfully assign high probabilities to the true cities-of-origin of testing samples. Conclusion Our framework shows good performance in predicting the geographic origin of metagenomic samples for cities where training data are available. Furthermore, we demonstrate the potential of the proposed method to predict metagenomic samples’ geolocations for samples from locations that are not in the training dataset.

Download Full-text

Reviewing Sentiment Analysis at the Shallow End

Transactions on Machine Learning and Artificial Intelligence ◽

10.14738/tmlai.84.8274 ◽

2020 ◽

Vol 8 (4) ◽

pp. 47-62

Author(s):

Francisca Oladipo ◽

Ogunsanya, F. B ◽

Musa, A. E. ◽

Ogbuju, E. E ◽

Ariwa, E.

Keyword(s):

Machine Learning ◽

Social Media ◽

Sentiment Analysis ◽

Information Exchange ◽

Training Data ◽

Data Set ◽

The Social ◽

Machine Learning Approach ◽

Media Space ◽

Social Media Platforms

The social media space has evolved into a large labyrinth of information exchange platform and due to the growth in the adoption of different social media platforms, there has been an increasing wave of interests in sentiment analysis as a paradigm for the mining and analysis of users’ opinions and sentiments based on their posts. In this paper, we present a review of contextual sentiment analysis on social media entries with a specific focus on Twitter. The sentimental analysis consists of two broad approaches which are machine learning which uses classification techniques to classify text and is further categorized into supervised learning and unsupervised learning; and the lexicon-based approach which uses a dictionary without using any test or training data set, unlike the machine learning approach.

Download Full-text

Graduation Prediction System On Students Using C4.5 Algorithm

Matrik Jurnal Manajemen Teknik Informatika dan Rekayasa Komputer ◽

10.30812/matrik.v19i2.685 ◽

2020 ◽

Vol 19 (2) ◽

pp. 358-365

Author(s):

Donny Kurniawan ◽

Anthony Anggrawan ◽

Hairani Hairani

Keyword(s):

Student Development ◽

Confusion Matrix ◽

Analysis Data ◽

Training Data ◽

Problem Analysis ◽

Large Numbers ◽

Testing Data ◽

C4.5 Algorithm ◽

Collection Data ◽

Student Graduation

Bumigora University College there are several things that are not balanced between the entry and exit of students who have completed their studies. Students who enter in large numbers, but students who graduate on time below the specified standards. As result, there was a huge accumulation of students in each graduation period. One solution to overcome the problem above needs a data mining based system in monitoring or utilizing student development in predicting graduation using the C4.5 algorithm. The stages of this research began with problem analysis, data collection, data requirement analysis, data design, coding, and testing. The results of this study are the implementation of the C4.5 algorithm for predicting student graduation on time or not. The data used is the data of students who have graduated from 2010 to 2012. The level of acceptance generated using the confusion matrix is 93,103% accuracy using 163 training data and 29 testing data or 85% training data and 15% testing data. The results of research and testing that has been done, C4.5 algorithm is very suitable to be used in student graduation prediction.

Download Full-text

DCT Untuk Ekstraksi Fitur Berbasis GLCM Pada Identifikasi Batik Menggunakan K-NN

Jambura Journal of Electrical and Electronics Engineering ◽

10.37905/jjeee.v3i1.7113 ◽

2021 ◽

Vol 3 (1) ◽

pp. 1-6

Author(s):

Zulfrianto Yusrin Lamasigi

Keyword(s):

Feature Extraction ◽

Discrete Cosine Transform ◽

Confusion Matrix ◽

Training Data ◽

Gray Level ◽

K Nearest Neighbor ◽

Cosine Transform ◽

A Value ◽

Testing Data ◽

Occurrence Matrix

Batik merupakan kain yang dibuat khusus, batik sendiri terbilang unik karena memiliki motif tertentu yang dibuat berdasarkan unsur budaya dari daerah asal batik itu dibuat. setiap motif dan warna batik berbeda-beda sehingga sulit untuk dikenali asal dari motir batik itu sendiri. penelitian ini bertujuan untuk meningkatkan hasil ektraksi fitur pada identifikasi motif batik. metode yang digunakan dalam penelitian ini adalah Discrete Cosine Transform bertujuan untuk meningkatkan hasil ektraksi fitur Gray Level Co-Occurrence Matrix untuk mendapatkan hasil akurasi identifikasi motif batik yang lebih baik, sedangkan untuk mengetahui nilai kedekatan antara data training dengan data testing citra batik akan menggunakan K-Nearest Neighbour berdasarkan nilai ekstraksi fitur yang diperoleh. dalam eksperimen ini dilakukan 4 kali percobaan berdasarkan sudut 0°, 45°, 90°, dan 135° pada nilai k=1, 3, 5, 7, dan 9. sementara itu, untuk menghitung tingkat akurasi dari klasifikasi KNN akan menggunakan confusion matrix. Dari uji coba yang di lakukan dengan menggunakan jumalah data training sebanyak 602 citra dan data testing 344 citra terhadap semua kelas berdasarkan sudut 0°, 45°, 90°, dan 135° pada nilai k=1, 3, 5, , dan 9 akurasi tertinggi yang diperoleh DCT-GLCM ada pada sudut 135° dengan nilai k=3 sebesar 84,88% dan yang paling rendah ada pada sudut 0° dengan nilai k=7 dan 9 sebesar 41,86%. Sedangkan hasil uji dengan hanya mennggunakan GLCM akurasi tertinggi ada pada sudut 135° dengan nilai k=1 sebesar 77,90% dan yang paling rendah ada pada sudut 90° dengan nilai k=7 sebesar 40,69%. Dari hasil uji coba yang dilakukan menunjukkan bahwah DCT bekerja dengan baik untuk meningkatkan hasil ekstraksi fitur GLCM yang dibuktikan dengan hasil rata-rata akurasi yang diperoleh.Batik is a specially made cloth, batik itself is unique because it has certain motifs that are made based on cultural elements from the area where the batik was made. each batik motif and color is different so it is difficult to identify the origin of the batik motir itself. This study aims to improve the feature extraction results in the identification of batik motifs. The method used in this research is Discrete Cosine Transform, which aims to increase the extraction of the Gray Level Co-Occurrence Matrix feature to obtain better accuracy results for identification of batik motifs, while to determine the closeness value between training data and batik image testing data will use K- Nearest Neighbor based on the feature extraction value obtained. In this experiment, 4 experiments were carried out based on angles of 0 °, 45 °, 90 °, and 135 ° at values of k = 1, 3, 5, 7, and 9. Meanwhile, to calculate the level of accuracy of the KNN classification, confusion matrix will be used. . From the trials carried out using the total training data of 602 images and testing data of 344 images for all classes based on angles of 0 °, 45 °, 90 °, and 135 ° at values of k = 1, 3, 5, and 9 accuracy The highest obtained by DCT-GLCM was at an angle of 135 ° with a value of k = 3 of 84.88% and the lowest was at an angle of 0 ° with values of k = 7 and 9 of 41.86%. While the test results using only GLCM, the highest accuracy is at an angle of 135 ° with a value of k = 1 of 77.90% and the lowest is at an angle of 90 ° with a value of k = 7 of 40.69%. From the results of the trials conducted, it shows that the DCT works well to improve the results of the GLCM feature extraction as evidenced by the average accuracy results obtained.

Download Full-text

Bi-LSTM Sentiment Classifier for Climate Change Issues in South Korea

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.b1056.0782s619 ◽

2019 ◽

Vol 8 (2S6) ◽

pp. 295-299

Keyword(s):

Climate Change ◽

Machine Learning ◽

Big Data ◽

South Korea ◽

Sentiment Analysis ◽

Training Data ◽

Learning Models ◽

Wide Range ◽

Machine Learning Models ◽

Big Data Technology

A sentiment analysis using SNS data can confirm various people’s thoughts. Thus an analysis using SNS can predict social problems and more accurately identify the complex causes of the problem. In addition, big data technology can identify SNS information that is generated in real time, allowing a wide range of people’s opinions to be understood without losing time. It can supplement traditional opinion surveys. The incumbent government mainly uses SNS to promote its policies. However, measures are needed to actively reflect SNS in the process of carrying out the policy. Therefore this paper developed a sentiment classifier that can identify public feelings on SNS about climate change. To that end, based on a dictionary formulated on the theme of climate change, we collected climate change SNS data for learning and tagged seven sentiments. Using training data, the sentiment classifier models were developed using machine learning models. The analysis showed that the Bi-LSTM model had the best performance than shallow models. It showed the highest accuracy (85.10%) in the seven sentiments classified, outperforming traditional machine learning (Naive Bayes and SVM) by approximately 34.53%p, and 7.14%p respectively. These findings substantiate the applicability of the proposed Bi-LSTM-based sentiment classifier to the analysis of sentiments relevant to diverse climate change issues.

Download Full-text