Improving of Imbalanced Data in Multiclass Classification for Sentiment Analysis using Supervised Term Weighting

Author(s):  
Jantima Polpinij ◽  
Khanista Namee
Information ◽  
2018 ◽  
Vol 9 (12) ◽  
pp. 317 ◽  
Author(s):  
Vincenzo Dentamaro ◽  
Donato Impedovo ◽  
Giuseppe Pirlo

Multiclass classification in cancer diagnostics, using DNA or Gene Expression Signatures, but also classification of bacteria species fingerprints in MALDI-TOF mass spectrometry data, is challenging because of imbalanced data and the high number of dimensions with respect to the number of instances. In this study, a new oversampling technique called LICIC will be presented as a valuable instrument in countering both class imbalance, and the famous “curse of dimensionality” problem. The method enables preservation of non-linearities within the dataset, while creating new instances without adding noise. The method will be compared with other oversampling methods, such as Random Oversampling, SMOTE, Borderline-SMOTE, and ADASYN. F1 scores show the validity of this new technique when used with imbalanced, multiclass, and high-dimensional datasets.


2021 ◽  
Vol 1 (1) ◽  
pp. 1-12
Author(s):  
Aytuğ Onan ◽  

With the advancement of information and communication technology, social networking and microblogging sites have become a vital source of information. Individuals can express their opinions, grievances, feelings, and attitudes about a variety of topics. Through microblogging platforms, they can express their opinions on current events and products. Sentiment analysis is a significant area of research in natural language processing because it aims to define the orientation of the sentiment contained in source materials. Twitter is one of the most popular microblogging sites on the internet, with millions of users daily publishing over one hundred million text messages (referred to as tweets). Choosing an appropriate term representation scheme for short text messages is critical. Term weighting schemes are critical representation schemes for text documents in the vector space model. We present a comprehensive analysis of Turkish sentiment analysis using nine supervised and unsupervised term weighting schemes in this paper. The predictive efficiency of term weighting schemes is investigated using four supervised learning algorithms (Naive Bayes, support vector machines, the k-nearest neighbor algorithm, and logistic regression) and three ensemble learning methods (AdaBoost, Bagging, and Random Subspace). The empirical evidence suggests that supervised term weighting models can outperform unsupervised term weighting models.


Author(s):  
Syaifulloh Amien Pandega Perdana ◽  
Teguh Bharata Aji ◽  
Ridi Ferdiana

Ulasan pelanggan merupakan opini terhadap kualitas barang atau jasa yang dirasakan konsumen. Ulasan pelanggan mengandung informasi yang berguna bagi konsumen maupun penyedia barang atau jasa. Ketersediaan ulasan pelanggan dalam jumlah besar pada website membutuhkan suatu framework untuk mengekstraksi sentimen secara otomatis. Sebuah ulasan pelanggan sering kali mengandung banyak aspek sehingga Aspect Based Sentiment Analysis (ABSA) harus digunakan untuk mengetahui polaritas masing-masing aspek. Salah satu tugas penting dalam ABSA adalah Aspect Category Detection. Metode machine learning untuk Aspect Category Detection sudah banyak dilakukan pada domain berbahasa Inggris, tetapi pada domain bahasa Indonesia masih sedikit. Makalah ini membandingkan kinerja tiga algoritme machine learning, yaitu Naïve Bayes (NB), Support Vector Machine (SVM), dan Random Forest (RF) pada ulasan pelanggan berbahasa Indonesia menggunakan Term Frequency–Inverse Document Frequency (TF-IDF) sebagai term weighting. Hasil menunjukkan bahwa RF memiliki kinerja paling unggul dibandingkan NB dan SVM pada tiga domain yang berbeda, yaitu restoran, hotel, dan e-commerce, dengan nilai f1-score untuk masing-masing domain adalah 84.3%, 85.7%, dan 89,3%.


Sentiment analysis, also known as Opinion Mining is one of the hottest topic Nowadays. in various social networking sites is one of the hottest topic and field nowadays. Here, we are using Twitter, the biggest web destinations for people to communicate with each other to perform the sentiment analysis and opinion mining by extracting the tweets by various users. The users can post brief text updates in twitter as it only allows 140 characters in one text message. Hashtags helps to search for tweets dealing with the specified subject. In previous researches, binary classification usually relies on the sentiment polarity(Positive , Negative and Neutral). The advantage is that multiple meaning of the same world might have different polarity, so it can be easily identified. In Multiclass classification, many tweets of one class are classified as if they belong to the others. The Neutral class presented the lowest precision in all the researches happened in this particular area. The set of tweets containing text and emoticon data will be classified into 13 classes. From each tweet, we extract different set of features using one hot encoding algorithm and use machine learning algorithms to perform classification. The entire tweets will be divided into training data sets and testing data sets. Training dataset will be pre-processed and classified using various Artificial Neural Network algorithms such as Reccurent Neural Network, Convolutional Neural Network etc. Moreover, the same procedure will be followed for the Text and Emoticon data. The developed model or system will be tested using the testing dataset. More precise and correct accuracy can be obtained or experienced using this multiclass classification of text and emoticons. 4 Key performance indicators will be used to evaluate the effectiveness of the corresponding approach.


Sign in / Sign up

Export Citation Format

Share Document