Authorship Attribution of Social Media and Literary Russian-Language Texts Using Machine Learning Methods and Feature Selection

Authorship attribution is one of the important fields of natural language processing (NLP). Its popularity is due to the relevance of implementing solutions for information security, as well as copyright protection, various linguistic studies, in particular, researches of social networks. The article is a continuation of the series of studies aimed at the identification of the Russian-language text’s author and reducing the required text volume. The focus of the study was aimed at the attribution of textual data created as a product of human online activity. The effectiveness of the models was evaluated on the two Russian-language datasets: literary texts and short comments from users of social networks. Classical machine learning (ML) algorithms, popular neural networks (NN) architectures, and their hybrids, including convolutional neural network (CNN), networks with long short-term memory (LSTM), Bidirectional Encoder Representations from Transformers (BERT), and fastText, that have not been used in previous studies, were applied to solve the problem. A particular experiment was devoted to the selection of informative features using genetic algorithms (GA) and evaluation of the classifier trained on the optimal feature space. Using fastText or a combination of support vector machine (SVM) with GA reduced the time costs by half in comparison with deep NNs with comparable accuracy. The average accuracy for literary texts was 80.4% using SVM combined with GA, 82.3% using deep NNs, and 82.1% using fastText. For social media comments, results were 66.3%, 73.2%, and 68.1%, respectively.

Download Full-text

Applying Machine Learning to Identify Anti-Vaccination Tweets during the COVID-19 Pandemic

International Journal of Environmental Research and Public Health ◽

10.3390/ijerph18084069 ◽

2021 ◽

Vol 18 (8) ◽

pp. 4069

Author(s):

Quyen G. To ◽

Kien G. To ◽

Van-Anh N. Huynh ◽

Nhung TQ Nguyen ◽

Diep TN Ngo ◽

...

Keyword(s):

Machine Learning ◽

Social Media ◽

Language Processing ◽

Short Term Memory ◽

Model Performance ◽

Vaccine Hesitancy ◽

Support Vector ◽

Short Term ◽

Long Short Term Memory ◽

Use Of Social Media

Anti-vaccination attitudes have been an issue since the development of the first vaccines. The increasing use of social media as a source of health information may contribute to vaccine hesitancy due to anti-vaccination content widely available on social media, including Twitter. Being able to identify anti-vaccination tweets could provide useful information for formulating strategies to reduce anti-vaccination sentiments among different groups. This study aims to evaluate the performance of different natural language processing models to identify anti-vaccination tweets that were published during the COVID-19 pandemic. We compared the performance of the bidirectional encoder representations from transformers (BERT) and the bidirectional long short-term memory networks with pre-trained GLoVe embeddings (Bi-LSTM) with classic machine learning methods including support vector machine (SVM) and naïve Bayes (NB). The results show that performance on the test set of the BERT model was: accuracy = 91.6%, precision = 93.4%, recall = 97.6%, F1 score = 95.5%, and AUC = 84.7%. Bi-LSTM model performance showed: accuracy = 89.8%, precision = 44.0%, recall = 47.2%, F1 score = 45.5%, and AUC = 85.8%. SVM with linear kernel performed at: accuracy = 92.3%, Precision = 19.5%, Recall = 78.6%, F1 score = 31.2%, and AUC = 85.6%. Complement NB demonstrated: accuracy = 88.8%, precision = 23.0%, recall = 32.8%, F1 score = 27.1%, and AUC = 62.7%. In conclusion, the BERT models outperformed the Bi-LSTM, SVM, and NB models in this task. Moreover, the BERT model achieved excellent performance and can be used to identify anti-vaccination tweets in future studies.

Download Full-text

Quantitative Methods for Analyzing Intimate Partner Violence in Microblogs: Observational Study

Journal of Medical Internet Research ◽

10.2196/15347 ◽

2020 ◽

Vol 22 (11) ◽

pp. e15347

Author(s):

Christopher Michael Homan ◽

J Nicolas Schrading ◽

Raymond W Ptucha ◽

Catherine Cerulli ◽

Cecilia Ovesdotter Alm

Keyword(s):

Machine Learning ◽

Social Media ◽

Intimate Partner Violence ◽

Language Processing ◽

Partner Violence ◽

Quantitative Methods ◽

Intimate Partner ◽

Support Vector ◽

Data Set ◽

Part Of Speech

Background Social media is a rich, virtually untapped source of data on the dynamics of intimate partner violence, one that is both global in scale and intimate in detail. Objective The aim of this study is to use machine learning and other computational methods to analyze social media data for the reasons victims give for staying in or leaving abusive relationships. Methods Human annotation, part-of-speech tagging, and machine learning predictive models, including support vector machines, were used on a Twitter data set of 8767 #WhyIStayed and #WhyILeft tweets each. Results Our methods explored whether we can analyze micronarratives that include details about victims, abusers, and other stakeholders, the actions that constitute abuse, and how the stakeholders respond. Conclusions Our findings are consistent across various machine learning methods, which correspond to observations in the clinical literature, and affirm the relevance of natural language processing and machine learning for exploring issues of societal importance in social media.

Download Full-text

Developing an Automatic System for Classifying Chatter About Health Services on Twitter: Case Study for Medicaid (Preprint)

10.2196/preprints.26616 ◽

2020 ◽

Author(s):

Yuan-Chi Yang ◽

Mohammed Ali Al-Garadi ◽

Whitney Bremer ◽

Jane M Zhu ◽

David Grande ◽

...

Keyword(s):

Social Media ◽

Health Services ◽

Language Processing ◽

Automatic System ◽

Short Term Memory ◽

The United States ◽

Support Vector ◽

K Nearest Neighbor ◽

Automatic Categorization ◽

Consumer Feedback

BACKGROUND The wide adoption of social media in daily life renders it a rich and effective resource for conducting near real-time assessments of consumers’ perceptions of health services. However, its use in these assessments can be challenging because of the vast amount of data and the diversity of content in social media chatter. OBJECTIVE This study aims to develop and evaluate an automatic system involving natural language processing and machine learning to automatically characterize user-posted Twitter data about health services using Medicaid, the single largest source of health coverage in the United States, as an example. METHODS We collected data from Twitter in two ways: via the public streaming application programming interface using Medicaid-related keywords (Corpus 1) and by using the website’s search option for tweets mentioning agency-specific handles (Corpus 2). We manually labeled a sample of tweets in 5 predetermined categories or other and artificially increased the number of training posts from specific low-frequency categories. Using the manually labeled data, we trained and evaluated several supervised learning algorithms, including support vector machine, random forest (RF), naïve Bayes, shallow neural network (NN), k-nearest neighbor, bidirectional long short-term memory, and bidirectional encoder representations from transformers (BERT). We then applied the best-performing classifier to the collected tweets for postclassification analyses to assess the utility of our methods. RESULTS We manually annotated 11,379 tweets (Corpus 1: 9179; Corpus 2: 2200) and used 7930 (69.7%) for training, 1449 (12.7%) for validation, and 2000 (17.6%) for testing. A classifier based on BERT obtained the highest accuracies (81.7%, Corpus 1; 80.7%, Corpus 2) and F1 scores on consumer feedback (0.58, Corpus 1; 0.90, Corpus 2), outperforming the second best classifiers in terms of accuracy (74.6%, RF on Corpus 1; 69.4%, RF on Corpus 2) and F1 score on consumer feedback (0.44, NN on Corpus 1; 0.82, RF on Corpus 2). Postclassification analyses revealed differing intercorpora distributions of tweet categories, with political (400778/628411, 63.78%) and consumer feedback (15073/27337, 55.14%) tweets being the most frequent for Corpus 1 and Corpus 2, respectively. CONCLUSIONS The broad and variable content of Medicaid-related tweets necessitates automatic categorization to identify topic-relevant posts. Our proposed system presents a feasible solution for automatic categorization and can be deployed and generalized for health service programs other than Medicaid. Annotated data and methods are available for future studies. CLINICALTRIAL

Download Full-text

Sentiment Analysis on Social Media Big Data With Multiple Tweet Words

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.j9684.0881019 ◽

2019 ◽

Vol 8 (10) ◽

pp. 3429-3434 ◽

Cited By ~ 2

Keyword(s):

Machine Learning ◽

Social Media ◽

Big Data ◽

Sentiment Analysis ◽

Language Processing ◽

Sentiment Classification ◽

Support Vector ◽

Decision Tree Classifier ◽

Machine Learning Classification ◽

Tree Classifier

The main objective of this paper is Analyze the reviews of Social Media Big Data of E-Commerce product’s. And provides helpful result to online shopping customers about the product quality and also provides helpful decision making idea to the business about the customer’s mostly liking and buying products. This covers all features or opinion words, like capitalized words, sequence of repeated letters, emoji, slang words, exclamatory words, intensifiers, modifiers, conjunction words and negation words etc available in tweets. The existing work has considered only two or three features to perform Sentiment Analysis with the machine learning technique Natural Language Processing (NLP). In this proposed work familiar Machine Learning classification models namely Multinomial Naïve Bayes, Support Vector Machine, Decision Tree Classifier, and, Random Forest Classifier are used for sentiment classification. The sentiment classification is used as a decision support system for the customers and also for the business.

Download Full-text

Twitter Sentiment Analysis Model

International Journal for Research in Applied Science and Engineering Technology ◽

10.22214/ijraset.2021.36309 ◽

2021 ◽

Vol 9 (VII) ◽

pp. 296-302

Author(s):

Pushkar Dubey

Keyword(s):

Machine Learning ◽

Social Networks ◽

Social Media ◽

Deep Learning ◽

Sentiment Analysis ◽

Language Processing ◽

Learning Approach ◽

Analysis Model ◽

Market Place ◽

The Social

Social networks are the main resources to gather information about people’s opinion towards different topics as they spend hours daily on social media and share their opinion. Twitter is one of the social media that is gaining popularity. Twitter offers organizations a fast and effective way to analyze customers’ perspectives toward the critical to success in the market place. Developing a program for sentiment analysis is an approach to be used to computationally measure customers’ perceptions. .We use natural language processing and machine learning concepts to create a model for analysis . In this paper we are discussing how we can create a model for analysis of twittes which is trained by various nlp , machine learning and Deep learning Approach.

Download Full-text

Spam text classification using LSTM Recurrent Neural Network

International Journal of Emerging Trends in Engineering Research ◽

10.30534/ijeter/2021/11992021 ◽

2021 ◽

Vol 9 (9) ◽

pp. 1271-1275

Keyword(s):

Machine Learning ◽

Neural Networks ◽

Language Processing ◽

Text Classification ◽

Short Term Memory ◽

Experimental Studies ◽

The Other ◽

Support Vector ◽

Data Points ◽

Class Labels

Sequence Classification is one of the on-demand research projects in the field of Natural Language Processing (NLP). Classifying a set of images or text into an appropriate category or class is a complex task that a lot of Machine Learning (ML) models fail to accomplish accurately and end up under-fitting the given dataset. Some of the ML algorithms used in text classification are KNN, Naïve Bayes, Support Vector Machines, Convolutional Neural Networks (CNNs), Recursive CNNs, Recurrent Neural Networks (RNNs), Long Short Term Memory (LSTM), etc. For this experimental study, LSTM and a few other algorithms were chosen for a more comparative study. The dataset used is the SMS Spam Collection Dataset from Kaggle and 150 more entries were additionally added from different sources. Two possible class labels for the data points are spam and ham. Each entry consists of the class label, a few sentences of text followed by a few useless features that are eliminated. After converting the text to the required format, the models are run and then evaluated using various metrics. In experimental studies, the LSTM gives much better classification accuracy than the other machine learning models. F1-Scores in the high nineties were achieved using LSTM for classifying the text. The other models showed very low F1-Scores and Cosine Similarities indicating that they had underperformed on the dataset. Another interesting observation is that the LSTM had reduced the number of false positives and false negatives than any other model.

Download Full-text

Ternion: An Autonomous Model for Fake News Detection

Applied Sciences ◽

10.3390/app11199292 ◽

2021 ◽

Vol 11 (19) ◽

pp. 9292

Author(s):

Noman Islam ◽

Asadullah Shaikh ◽

Asma Qaiser ◽

Yousef Asiri ◽

Sultan Almakdi ◽

...

Keyword(s):

Machine Learning ◽

Social Media ◽

Support Vector Machine ◽

Logistic Regression ◽

Language Processing ◽

Negative Impact ◽

Machine Learning Techniques ◽

Support Vector ◽

Fake News ◽

Processing Techniques

In recent years, the consumption of social media content to keep up with global news and to verify its authenticity has become a considerable challenge. Social media enables us to easily access news anywhere, anytime, but it also gives rise to the spread of fake news, thereby delivering false information. This also has a negative impact on society. Therefore, it is necessary to determine whether or not news spreading over social media is real. This will allow for confusion among social media users to be avoided, and it is important in ensuring positive social development. This paper proposes a novel solution by detecting the authenticity of news through natural language processing techniques. Specifically, this paper proposes a novel scheme comprising three steps, namely, stance detection, author credibility verification, and machine learning-based classification, to verify the authenticity of news. In the last stage of the proposed pipeline, several machine learning techniques are applied, such as decision trees, random forest, logistic regression, and support vector machine (SVM) algorithms. For this study, the fake news dataset was taken from Kaggle. The experimental results show an accuracy of 93.15%, precision of 92.65%, recall of 95.71%, and F1-score of 94.15% for the support vector machine algorithm. The SVM is better than the second best classifier, i.e., logistic regression, by 6.82%.

Download Full-text

Sentiment Analysis for Social Media using SVM Classifier of Machine Learning

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.i1107.0789s419 ◽

2019 ◽

Vol 8 (9S4) ◽

pp. 39-47

Keyword(s):

Machine Learning ◽

Social Media ◽

Feature Selection ◽

Sentiment Analysis ◽

Language Processing ◽

Cuckoo Search ◽

Support Vector ◽

Svm Classifier ◽

Feature Selection Technique ◽

Performance Factors

Sentiment analysis is an area of natural language processing (NLP) and machine learning where the text is to be categorized into predefined classes i.e. positive and negative. As the field of internet and social media, both are increasing day by day, the product of these two nowadays is having many more feedbacks from the customer than before. Text generated through social media, blogs, post, review on any product, etc. has become the bested suited cases for consumer sentiment, providing a best-suited idea for that particular product. Features are an important source for the classification task as more the features are optimized, the more accurate are results. Therefore, this research paper proposes a hybrid feature selection which is a combination of Particle swarm optimization (PSO) and cuckoo search. Due to the subjective nature of social media reviews, hybrid feature selection technique outperforms the traditional technique. The performance factors like f-measure, recall, precision, and accuracy tested on twitter dataset using Support Vector Machine (SVM) classifier and compared with convolution neural network. Experimental results of this paper on the basis of different parameters show that the proposed work outperforms the existing work

Download Full-text

Sentiment Analysis on Social Media using Machine Learning Approach

10.22541/au.163620143.37655829/v1 ◽

2021 ◽

Author(s):

Erick Omuya ◽

George Okeyo ◽

Michael Kimwele

Keyword(s):

Machine Learning ◽

Social Media ◽

Sentiment Analysis ◽

Language Processing ◽

Machine Learning Algorithms ◽

Machine Learning Techniques ◽

Support Vector ◽

Learning Approach ◽

K Nearest Neighbor ◽

Machine Learning Approach

Social media has been embraced by different people as a convenient and official medium of communication. People write messages and attach images and videos on Twitter, Facebook and other social media which they share. Social media therefore generates a lot of data that is rich in sentiments from these updates. Sentiment analysis has been used to determine opinions of clients, for instance, relating to a particular product or company. Knowledge based approach and Machine learning approach are among the strategies that have been used to analyze these sentiments. The performance of sentiment analysis is however distorted by noise, the curse of dimensionality, the data domains and size of data used for training and testing. This research aims at developing a model for sentiment analysis in which dimensionality reduction and the use of different parts of speech improves sentiment analysis performance. It uses natural language processing for filtering, storing and performing sentiment analysis on the data from social media. The model is tested using Naïve Bayes, Support Vector Machines and K-Nearest neighbor machine learning algorithms and its performance compared with that of two other Sentiment Analysis models. Experimental results show that the model improves sentiment analysis performance using machine learning techniques.

Download Full-text

Named Entity Recognition for Code Mixed Social Media Sentences

International Journal of Software Science and Computational Intelligence ◽

10.4018/ijssci.2021040102 ◽

2021 ◽

Vol 13 (2) ◽

pp. 23-36

Author(s):

Yashvardhan Sharma ◽

Rupal Bhargava ◽

Bapiraju Vamsi Tadikonda

Keyword(s):

Social Media ◽

Language Processing ◽

Short Term Memory ◽

Named Entity Recognition ◽

Entity Recognition ◽

Machine Learning Techniques ◽

Support Vector ◽

Internet Applications ◽

Named Entity ◽

Code Mixing

With the increase of internet applications and social media platforms there has been an increase in the informal way of text communication. People belonging to different regions tend to mix their regional language with English on social media text. This has been the trend with many multilingual nations now and is commonly known as code mixing. In code mixing, multiple languages are used within a statement. The problem of named entity recognition (NER) is a well-researched topic in natural language processing (NLP), but the present NER systems tend to perform inefficiently on code-mixed text. This paper proposes three approaches to improve named entity recognizers for handling code-mixing. The first approach is based on machine learning techniques such as support vector machines and other tree-based classifiers. The second approach is based on neural networks and the third approach uses long short-term memory (LSTM) architecture to solve the problem.

Download Full-text