scholarly journals Evolutionary Optimization of Ensemble Learning to Determine Sentiment Polarity in an Unbalanced Multiclass Corpus

Entropy ◽  
2020 ◽  
Vol 22 (9) ◽  
pp. 1020
Author(s):  
Consuelo V. García-Mendoza ◽  
Omar J. Gambino ◽  
Miguel G. Villarreal-Cervantes ◽  
Hiram Calvo

Sentiment polarity classification in social media is a very important task, as it enables gathering trends on particular subjects given a set of opinions. Currently, a great advance has been made by using deep learning techniques, such as word embeddings, recurrent neural networks, and encoders, such as BERT. Unfortunately, these techniques require large amounts of data, which, in some cases, is not available. In order to model this situation, challenges, such as the Spanish TASS organized by the Spanish Society for Natural Language Processing (SEPLN), have been proposed, which pose particular difficulties: First, an unwieldy balance in the training and the test set, being this latter more than eight times the size of the training set. Another difficulty is the marked unbalance in the distribution of classes, which is also different between both sets. Finally, there are four different labels, which create the need to adapt current classifications methods for multiclass handling. Traditional machine learning methods, such as Naïve Bayes, Logistic Regression, and Support Vector Machines, achieve modest performance in these conditions, but used as an ensemble it is possible to attain competitive execution. Several strategies to build classifier ensembles have been proposed; this paper proposes estimating an optimal weighting scheme using a Differential Evolution algorithm focused on dealing with particular issues that multiclass classification and unbalanced corpora pose. The ensemble with the proposed optimized weighting scheme is able to improve the classification results on the full test set of the TASS challenge (General corpus), achieving state of the art performance when compared with other works on this task, which make no use of NLP techniques.

Author(s):  
Alaa Altheneyan ◽  
Mohamed El Bachir Menai

Paraphrase identification is a natural language processing (NLP) problem that involves the determination of whether two text segments have the same meaning. Various NLP applications rely on a solution to this problem, including automatic plagiarism detection, text summarization, machine translation (MT), and question answering. The methods for identifying paraphrases found in the literature fall into two main classes: similarity-based methods and classification methods. This paper presents a critical study and an evaluation of existing methods for paraphrase identification and its application to automatic plagiarism detection. It presents the classes of paraphrase phenomena, the main methods, and the sets of features used by each particular method. All the methods and features used are discussed and enumerated in a table for easy comparison. Their performances on benchmark corpora are also discussed and compared via tables. Automatic plagiarism detection is presented as an application of paraphrase identification. The performances on benchmark corpora of existing plagiarism detection systems able to detect paraphrases are compared and discussed. The main outcome of this study is the identification of word overlap, structural representations, and MT measures as feature subsets that lead to the best performance results for support vector machines in both paraphrase identification and plagiarism detection on corpora. The performance results achieved by deep learning techniques highlight that these techniques are the most promising research direction in this field.


2021 ◽  
Vol 21 ◽  
pp. 103-114
Author(s):  
Azza Abugharsa

Over the recent decades, there has been a significant increase and development of resources for Arabic natural language processing. This includes the task of exploring Arabic Language Sentiment Analysis (ALSA) from Arabic utterances in both Modern Standard Arabic (MSA) and different Arabic dialects. This study focuses on detecting sentiment in poems written in Misurata Arabic sub-dialect spoken in Misurata, Libya. The tools used to detect sentiment from the dataset are Sklearn as well as Mazajak sentiment tool1. Logistic Regression, Random Forest, Naive Bayes (NB), and Support Vector Machines (SVM) classifiers are used with Sklearn, while the Convolutional Neural Network (CNN) is implemented with Mazajak. The results show that the traditional classifiers score a higher level of accuracy as compared to Mazajak which is built on an algorithm that includes deep learning techniques. More research is suggested to analyze Arabic sub-dialect poetry in order to investigate the aspects that contribute to sentiments in these multi-line texts; for example, the use of figurative language such as metaphors.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Tomoaki Mameno ◽  
Masahiro Wada ◽  
Kazunori Nozaki ◽  
Toshihito Takahashi ◽  
Yoshitaka Tsujioka ◽  
...  

AbstractThe purpose of this retrospective cohort study was to create a model for predicting the onset of peri-implantitis by using machine learning methods and to clarify interactions between risk indicators. This study evaluated 254 implants, 127 with and 127 without peri-implantitis, from among 1408 implants with at least 4 years in function. Demographic data and parameters known to be risk factors for the development of peri-implantitis were analyzed with three models: logistic regression, support vector machines, and random forests (RF). As the results, RF had the highest performance in predicting the onset of peri-implantitis (AUC: 0.71, accuracy: 0.70, precision: 0.72, recall: 0.66, and f1-score: 0.69). The factor that had the most influence on prediction was implant functional time, followed by oral hygiene. In addition, PCR of more than 50% to 60%, smoking more than 3 cigarettes/day, KMW less than 2 mm, and the presence of less than two occlusal supports tended to be associated with an increased risk of peri-implantitis. Moreover, these risk indicators were not independent and had complex effects on each other. The results of this study suggest that peri-implantitis onset was predicted in 70% of cases, by RF which allows consideration of nonlinear relational data with complex interactions.


2014 ◽  
Vol 28 (2) ◽  
pp. 3-28 ◽  
Author(s):  
Hal R. Varian

Computers are now involved in many economic transactions and can capture data associated with these transactions, which can then be manipulated and analyzed. Conventional statistical and econometric techniques such as regression often work well, but there are issues unique to big datasets that may require different tools. First, the sheer size of the data involved may require more powerful data manipulation tools. Second, we may have more potential predictors than appropriate for estimation, so we need to do some kind of variable selection. Third, large datasets may allow for more flexible relationships than simple linear models. Machine learning techniques such as decision trees, support vector machines, neural nets, deep learning, and so on may allow for more effective ways to model complex relationships. In this essay, I will describe a few of these tools for manipulating and analyzing big data. I believe that these methods have a lot to offer and should be more widely known and used by economists.


2018 ◽  
Vol 7 (2.8) ◽  
pp. 684 ◽  
Author(s):  
V V. Ramalingam ◽  
Ayantan Dandapath ◽  
M Karthik Raja

Heart related diseases or Cardiovascular Diseases (CVDs) are the main reason for a huge number of death in the world over the last few decades and has emerged as the most life-threatening disease, not only in India but in the whole world. So, there is a need of reliable, accurate and feasible system to diagnose such diseases in time for proper treatment. Machine Learning algorithms and techniques have been applied to various medical datasets to automate the analysis of large and complex data. Many researchers, in recent times, have been using several machine learning techniques to help the health care industry and the professionals in the diagnosis of heart related diseases. This paper presents a survey of various models based on such algorithms and techniques andanalyze their performance. Models based on supervised learning algorithms such as Support Vector Machines (SVM), K-Nearest Neighbour (KNN), NaïveBayes, Decision Trees (DT), Random Forest (RF) and ensemble models are found very popular among the researchers.


Author(s):  
Hesham M. Al-Ammal

Detection of anomalies in a given data set is a vital step in several applications in cybersecurity; including intrusion detection, fraud, and social network analysis. Many of these techniques detect anomalies by examining graph-based data. Analyzing graphs makes it possible to capture relationships, communities, as well as anomalies. The advantage of using graphs is that many real-life situations can be easily modeled by a graph that captures their structure and inter-dependencies. Although anomaly detection in graphs dates back to the 1990s, recent advances in research utilized machine learning methods for anomaly detection over graphs. This chapter will concentrate on static graphs (both labeled and unlabeled), and the chapter summarizes some of these recent studies in machine learning for anomaly detection in graphs. This includes methods such as support vector machines, neural networks, generative neural networks, and deep learning methods. The chapter will reflect the success and challenges of using these methods in the context of graph-based anomaly detection.


2020 ◽  
Vol 24 (5) ◽  
pp. 1141-1160
Author(s):  
Tomás Alegre Sepúlveda ◽  
Brian Keith Norambuena

In this paper, we apply sentiment analysis methods in the context of the first round of the 2017 Chilean elections. The purpose of this work is to estimate the voting intention associated with each candidate in order to contrast this with the results from classical methods (e.g., polls and surveys). The data are collected from Twitter, because of its high usage in Chile and in the sentiment analysis literature. We obtained tweets associated with the three main candidates: Sebastián Piñera (SP), Alejandro Guillier (AG) and Beatriz Sánchez (BS). For each candidate, we estimated the voting intention and compared it to the traditional methods. To do this, we first acquired the data and labeled the tweets as positive or negative. Afterward, we built a model using machine learning techniques. The classification model had an accuracy of 76.45% using support vector machines, which yielded the best model for our case. Finally, we use a formula to estimate the voting intention from the number of positive and negative tweets for each candidate. For the last period, we obtained a voting intention of 35.84% for SP, compared to a range of 34–44% according to traditional polls and 36% in the actual elections. For AG we obtained an estimate of 37%, compared with a range of 15.40% to 30.00% for traditional polls and 20.27% in the elections. For BS we obtained an estimate of 27.77%, compared with the range of 8.50% to 11.00% given by traditional polls and an actual result of 22.70% in the elections. These results are promising, in some cases providing an estimate closer to reality than traditional polls. Some differences can be explained due to the fact that some candidates have been omitted, even though they held a significant number of votes.


2020 ◽  
Vol 17 (8) ◽  
pp. 3598-3604
Author(s):  
M. S. Roobini ◽  
M. Lakshmi

Alzheimer’s Disease (AD) is a standout amongst the most familiar types of memory loss influencing a huge number of senior individuals around the world which is the main source of dementia and memory misfortune. AD causes shrinkage in hippocampus and cerebral cortex and it grows the ventricles in the mind Enhancing home and network based composed consideration is basic to alleviating Alzheimer’s impacts on people and families and to decreasing mounting medicinal services costs. Distinguishing early morphological changes in the mind and making early determination are vital for Alzheimer’s ailment (AD). A few machine learning techniques, for example, Support vector machines have been utilized and a portion of these strategies have been appeared to be extremely compelling in diagnosing AD from neuroimages, some of the time significantly more viable than human radiologists. MRI uncover the data of AD however decay districts are diverse for various individuals which makes the finding somewhat trickier. By utilizing Convolutional Neural Networks, the issue can be settled with insignificant mistake rate. This paper proposes a profound Convolutional Neural Network (CNN) for Alzheimer’s Disease finding utilizing mind MRI information examination. The calculation was prepared and tried utilizing the MRI information from Alzheimer’s Disease Neuroimaging Initiative.


2021 ◽  
Vol 23 (08) ◽  
pp. 532-537
Author(s):  
Cherlakola Abhinav Reddy ◽  
◽  
Sai Nitesh Gadiraju ◽  
Dr. Samala Nagaraj ◽  
◽  
...  

Online media has progressively obtained integral to the route billions of individuals experience news and occasions, frequently bypassing writers—the conventional guardians of breaking news. Occasions,in reality, make a relating spike of posts (tweets) on Twitter. This projects a great deal of significance on the validity of data found via online media stages like Twitter. We have utilized different managed learning techniques like Naïve Bayes, Decision Trees, and Support Vector Machines on the information to separate tweets among genuine and counterfeit news. For our AI models, we have utilized tweet and client highlights as our indicators. We accomplished a precision of 88% utilizing the Random Forest classifier and 88% utilizing the Decision tree. Notwithstanding, we accept that breaking down client records would build the accuracy of our models.


2019 ◽  
Vol 2019 ◽  
pp. 1-7 ◽  
Author(s):  
Jorge D. Mello-Román ◽  
Julio C. Mello-Román ◽  
Santiago Gómez-Guerrero ◽  
Miguel García-Torres

Early diagnosis of dengue continues to be a concern for public health in countries with a high incidence of this disease. In this work, we compared two machine learning techniques: artificial neural networks (ANN) and support vector machines (SVM) as assistance tools for medical diagnosis. The performance of classification models was evaluated in a real dataset of patients with a previous diagnosis of dengue extracted from the public health system of Paraguay during the period 2012–2016. The ANN multilayer perceptron achieved better results with an average of 96% accuracy, 96% sensitivity, and 97% specificity, with low variation in thirty different partitions of the dataset. In comparison, SVM polynomial obtained results above 90% for accuracy, sensitivity, and specificity.


Sign in / Sign up

Export Citation Format

Share Document