scholarly journals Performance comparison of TF-IDF and Word2Vec models for emotion text classification

2021 ◽  
Vol 10 (5) ◽  
pp. 2780-2788
Author(s):  
Denis Eka Cahyani ◽  
Irene Patasik

Emotion is the human feeling when communicating with other humans or reaction to everyday events. Emotion classification is needed to recognize human emotions from text. This study compare the performance of the TF-IDF and Word2Vec models to represent features in the emotional text classification. We use the support vector machine (SVM) and Multinomial Naïve Bayes (MNB) methods for classification of emotional text on commuter line and transjakarta tweet data. The emotion classification in this study has two steps. The first step classifies data that contain emotion or no emotion. The second step classifies data that contain emotions into five types of emotions i.e. happy, angry, sad, scared, and surprised. This study used three scenarios, namely SVM with TF-IDF, SVM with Word2Vec, and MNB with TF-IDF. The SVM with TF-IDF method generate the highest accuracy compared to other methods in the first dan second steps classification, then followed by the MNB with TF-IDF, and the last is SVM with Word2Vec. Then, the evaluation using precision, recall, and F1-measure results that the SVM with TF-IDF provides the best overall method. This study shows TF-IDF modeling has better performance than Word2Vec modeling and this study improves classification performance results compared to previous studies.

2018 ◽  
Vol 21 (62) ◽  
pp. 1
Author(s):  
Jorge E. Camargo ◽  
Vladimir Vargas-Calderon ◽  
Nelson Vargas ◽  
Liliana Calderón-Benavides

With the purpose of classifying text based on its sentiment polarity (positive or negative), we proposed an extension of a 68,000 tweets corpus through the inclusion of word definitions from a dictionary of the Real Academia Espa\~{n}ola de la Lengua (RAE). A set of 28,000 combinations of 6 Word2Vec and support vector machine parameters were considered in order to evaluate how positively would affect the inclusion of a RAE's dictionary definitions classification performance. We found that such a corpus extension significantly improve the classification accuracy. Therefore, we conclude that the inclusion of a RAE's dictionary increases the semantic relations learned by Word2Vec allowing a better classification accuracy.


Author(s):  
Desi Ramayanti

In digital business, the managerial commonly need to process text so that it can be used to support decision-making. The number of text documents contained ideas and opinions is progressing and challenging to understand one by one. Whereas if the data are processed and correctly rendered using machine learning, it can present a general overview of a particular case, organization, or object quickly. Numerous researches have been accomplished in this research area, nevertheless, most of the studies concentrated on English text classification. Every language has various techniques or methods to classify text depending on the characteristics of its grammar. The result of classification among languages may be different even though it used the same algorithm. Given the greatness of text classification, text classification algorithms that can be implemented is the support vector machine (SVM) and Random Forest (RF). Based on the background above, this research is aimed to find out the performance of support vector machine algorithm and random forest in classification of Indonesian text. 1. Result of SVM classifier with cross validation k-10 is derived the best accuracy with value 0.9648, however, it spends computational time as long as 40.118 second. Then, result of RF classifier with values, i.e. 'bootstrap': False, 'min_samples_leaf': 1, 'n_estimators': 10, 'min_samples_split': 3, 'criterion': 'entropy', 'max_features': 3, 'max_depth': None is achieved accuracy is 0.9561 and computational time 109.399 second.


Sensors ◽  
2020 ◽  
Vol 20 (6) ◽  
pp. 1704 ◽  
Author(s):  
Alghannai Aghnaiya ◽  
Yaser Dalveren ◽  
Ali Kara

Radio frequency fingerprinting (RFF) is one of the communication network’s security techniques based on the identification of the unique features of RF transient signals. However, extracting these features could be burdensome, due to the nonstationary nature of transient signals. This may then adversely affect the accuracy of the identification of devices. Recently, it has been shown that the use of variational mode decomposition (VMD) in extracting features from Bluetooth (BT) transient signals offers an efficient way to improve the classification accuracy. To do this, VMD has been used to decompose transient signals into a series of band-limited modes, and higher order statistical (HOS) features are extracted from reconstructed transient signals. In this study, the performance bounds of VMD in RFF implementation are scrutinized. Firstly, HOS features are extracted from the band-limited modes, and then from the reconstructed transient signals directly. Performance comparison due to both HOS feature sets is presented. Moreover, the lower SNR bound within which the VMD can achieve acceptable accuracy in the classification of BT devices is determined. The approach has been tested experimentally with BT devices by employing a Linear Support Vector Machine (LSVM) classifier. According to the classification results, a higher classification performance is achieved (~4% higher) at lower SNR levels (−5–5 dB) when HOS features are extracted from band-limited modes in the implementation of VMD in RFF of BT devices.


PLoS ONE ◽  
2020 ◽  
Vol 15 (12) ◽  
pp. e0243907
Author(s):  
Kevin Teh ◽  
Paul Armitage ◽  
Solomon Tesfaye ◽  
Dinesh Selvarajah ◽  
Iain D. Wilkinson

One of the fundamental challenges when dealing with medical imaging datasets is class imbalance. Class imbalance happens where an instance in the class of interest is relatively low, when compared to the rest of the data. This study aims to apply oversampling strategies in an attempt to balance the classes and improve classification performance. We evaluated four different classifiers from k-nearest neighbors (k-NN), support vector machine (SVM), multilayer perceptron (MLP) and decision trees (DT) with 73 oversampling strategies. In this work, we used imbalanced learning oversampling techniques to improve classification in datasets that are distinctively sparser and clustered. This work reports the best oversampling and classifier combinations and concludes that the usage of oversampling methods always outperforms no oversampling strategies hence improving the classification results.


Author(s):  
Anang Anggono Lutfi ◽  
Adhistya Erna Permanasari ◽  
Silmi Fauziati

In the version of this article initially published, there were some errors in Section III, Methods and Section VI, Conclusions. In Preprocessing of Methods, there is a sentence “The informal words may be in the form of slang words or abbreviations that are often used in daily life like cp at (from “cepat” or fast), blum (from “belum” or not yet), and gak (from “tidak” or no).”. The correct sentence is “The informal words may be in the form of slang words or abbreviations that are often used in daily life like cpat (from “cepat” or fast), blum (from “belum” or not yet), and gak (from “tidak” or no).”. In Text Classification of Methods, there is a sentence “Where P(B|A) is the probability of B appearance when A is known? The value P(A|B) is the probability of an appearance if B is known. P(A) is the probability of an appearance, while P(B) is the probability of B appearance.”. The correct sentence is “Where P(B│A) is the probability of the appearance of B when A is known. The value of P(A|B) is the probability of the appearance of A if B is known. P(A) is the probability of the appearance of A, while P(B) is the probability of the appearance of B.”. In Conclusions, a sentence “The accuracy reaches 93.42%; using 25% features with highest TF-IDF” should be changed to “The accuracy reaches 93.65%; using 25% features with highest TF-IDF” based on the results in Fig.3. These errors have been corrected in the PDF versions of the article.


Sign in / Sign up

Export Citation Format

Share Document