scholarly journals Statistical Analysis for Twitter Spam Detection

Author(s):  
Ganesh Udge ◽  
Mahesh Mohite ◽  
Shubhankar Bendre ◽  
Yogeshwar Birnagal ◽  
Disha Wankhede

The spreading and learning of new discoveries and information is made available using current online social networks. In Recent days, the solutions may be irrelevant to the actual content; also termed as attacks in the layman’s term such attacks are been performed on Twitter as well and called as Twitter spammers. The quality of data is being compromised by addition of malicious and harmful information using URL, bio, emoticons, audio, images/videos & hash-tags through different accounts by exchanging tweets, personal messages (Direct Message’s) & re-tweets. Misleading sites may be linked with the malicious links which may affect adverse effects on the user and also interfere in their decision making processes. To improve user-experience from the spammers attacks, the training twitter dataset are applied and then by extracting and using the 12 lightweight features like user’s age, number of followers, count of tweets and re-tweets, etc. are used to distinguish the spam from non-spam. For enhancing the performance, the discretization of the function is important for transmission of spam detection between tweets. Our system creates classification model for Spam detection which includes binary classification and automatic learning algorithms viz. Naïve Bayes classifier or Support Vector Machine classifier which understands the behaviour of the model. The system will categorize the tweets from datasets into Spam and Non-spam classes and provide the user’s feed with only the relevant information. The system will report the impact of data-related factors such as relationship between spam and non-spam tweets, size of training dataset, data sampling and detection performance. The proposed system’s function is detection and analysis of the simple and variable twitter spam over time. The spam detection is a major challenge for the system and shortens the gap between performance appraisals and focuses primarily on data, features and patterns to identify real user and informing it about the spam tweets along with the performance statistics. The work is to detect spammed tweets in real time, since the new tweets may show patterns and this will help for training and updating dataset and in knowledge base.

Author(s):  
S. Boeke ◽  
M. J. C. van den Homberg ◽  
A. Teklesadik ◽  
J. L. D. Fabila ◽  
D. Riquet ◽  
...  

Abstract. Reliable predictions of the impact of natural hazards turning into a disaster is important for better targeting humanitarian response as well as for triggering early action. Open data and machine learning can be used to predict loss and damage to the houses and livelihoods of affected people. This research focuses on agricultural loss, more specifically rice loss in the Philippines due to typhoons. Regression and binary classification algorithms are trained using feature selection methods to find the most important explanatory features. Both geographical data from every province, and typhoon specific features of 11 historical typhoons are used as input. The percentage of lost rice area is considered as the output, with an average value of 7.1%. As for the regression task, the support vector regressor performed best with a Mean Absolute Error of 6.83 percentage points. For the classification model, thresholds of 20%, 30% and 40% are tested in order to find the best performing model. These thresholds represent different levels of lost rice fields for triggering anticipatory action towards farmers. The binary classifiers are trained to increase its ability to rightly predict the positive samples. In all three cases, the support vector classifier performed the best with a recall score of 88%, 75% and 81.82%, respectively. However, the precision score for each of these models was low: 17.05%, 14.46% and 10.84%, respectively. For both the support vector regressor and classifier, of all 14 available input features, only wind speed was selected as explanatory feature. Yet, for the other algorithms that were trained in this study, other sets of features were selected depending also on the hyperparameter settings. This variation in selected feature sets as well as the imprecise predictions were consequences of the small dataset that was used for this study. It is therefore important that data for more typhoons as well as data on other explanatory variables are gathered in order to make more robust and accurate predictions. Also, if loss data becomes available on municipality-level, rather than province-level, the models will become more accurate and valuable for operationalization.


Entropy ◽  
2021 ◽  
Vol 23 (11) ◽  
pp. 1502
Author(s):  
Ben Wilkes ◽  
Igor Vatolkin ◽  
Heinrich Müller

We present a multi-modal genre recognition framework that considers the modalities audio, text, and image by features extracted from audio signals, album cover images, and lyrics of music tracks. In contrast to pure learning of features by a neural network as done in the related work, handcrafted features designed for a respective modality are also integrated, allowing for higher interpretability of created models and further theoretical analysis of the impact of individual features on genre prediction. Genre recognition is performed by binary classification of a music track with respect to each genre based on combinations of elementary features. For feature combination a two-level technique is used, which combines aggregation into fixed-length feature vectors with confidence-based fusion of classification results. Extensive experiments have been conducted for three classifier models (Naïve Bayes, Support Vector Machine, and Random Forest) and numerous feature combinations. The results are presented visually, with data reduction for improved perceptibility achieved by multi-objective analysis and restriction to non-dominated data. Feature- and classifier-related hypotheses are formulated based on the data, and their statistical significance is formally analyzed. The statistical analysis shows that the combination of two modalities almost always leads to a significant increase of performance and the combination of three modalities in several cases.


2022 ◽  
Vol 9 (1) ◽  
pp. 0-0

This article investigates the impact of data-complexity and team-specific characteristics on machine learning competition scores. Data from five real-world binary classification competitions hosted on Kaggle.com were analyzed. The data-complexity characteristics were measured in four aspects including standard measures, sparsity measures, class imbalance measures, and feature-based measures. The results showed that the higher the level of the data-complexity characteristics was, the lower the predictive ability of the machine learning model was as well. Our empirical evidence revealed that the imbalance ratio of the target variable was the most important factor and exhibited a nonlinear relationship with the model’s predictive abilities. The imbalance ratio adversely affected the predictive performance when it reached a certain level. However, mixed results were found for the impact of team-specific characteristics measured by team size, team expertise, and the number of submissions on team performance. For high-performing teams, these factors had no impact on team score.


Author(s):  
Chen Xia ◽  
Yuqing Hu

Natural disasters are showing an increase in the magnitude, frequency, and geographic distribution. Studies have shown that individuals’ self-sufficiency, which largely depends on household preparedness, is very important for hazard mitigation in at least the first 72 hours following a disaster. However, for factors that influence a household’s disaster preparedness, though there are many studies trying to identify from different aspects, we still lack an integrative analysis on how these factors contribute to a household’s preparation. This paper aims to build a classification model to predict whether a household has prepared for a potential disaster based on their personal characteristics and the environment they located. We collect data from the Federal Emergency Management Agency’s National Household Survey in 2018 and train four classification models - logistic regression, decision trees, support vector machines, and multi-layer perceptron classifier models- to predict the impact of personal characteristics and the environment they located on household prepare for the potential natural disaster. Results show that the multi-layer perceptron classifier model outperforms others with the highest scoring on both recall (0.8531) and f1 measure (0.7386). In addition, feature selection results also show that among other factors, a household’s accessibility to disaster-related information is the most critical factor that impacts household disaster preparation. Though there is still room for further parameter optimization, the model gives a clue that we could support disaster management by gathering publicly accessible data.


Author(s):  
Jie Xu ◽  
Xianglong Liu ◽  
Zhouyuan Huo ◽  
Cheng Deng ◽  
Feiping Nie ◽  
...  

Support Vector Machine (SVM) is originally proposed as a binary classification model, and it has already achieved great success in different applications. In reality, it is more often to solve a problem which has more than two classes. So, it is natural to extend SVM to a multi-class classifier. There have been many works proposed to construct a multi-class classifier based on binary SVM, such as one versus all strategy, one versus one strategy and Weston's multi-class SVM. One versus all strategy and one versus one strategy split the multi-class problem to multiple binary classification subproblems, and we need to train multiple binary classifiers. Weston's multi-class SVM is formed by ensuring risk constraints and imposing a specific regularization, like Frobenius norm. It is not derived by maximizing the margin between hyperplane and training data which is the motivation in SVM. In this paper, we propose a multi-class SVM model from the perspective of maximizing margin between training points and hyperplane, and analyze the relation between our model and other related methods. In the experiment, it shows that our model can get better or compared results when comparing with other related methods.


Telematika ◽  
2018 ◽  
Vol 15 (1) ◽  
pp. 77
Author(s):  
Resky Rayvano Moningka ◽  
Djoko Budiyanto Setyohadi ◽  
Khaerunnisa Khaerunnisa ◽  
Pranowo Pranowo

AbstractMount Merapi Eruption in 2010 was the biggest after 1872. The impact of this eruption was felt by people who lived around the areas which were affected by this Merapi Eruption. Thus, disaster management was done. One of the disaster management was the fulfillment of basic needs. This research aims to collect public opinion against the fulfillment of basic needs in the shelters after Merapi Eruption based on Twitter data. The algorithm which is used in this research is Support Vector Machine to develop classification model over the data that has been collected. The expected result from this study is to know the basic needs in a shelter. The accuracy gained by performing Cross Validation for 10 folds from Support Vector Machine is 87.96% and Maximum Entropy is 87.45%. Keywords: twitter, sentiment analisis, merapi eruption, support vector machine AbstrakErupsi Gunung Merapi 2010 merupakan yang terbesar setelah tahun 1872. Dampak dari Erupsi Gunung Merapi dirasakan oleh masyarakat yang tinggal di daerah terdampak Erupsi Merapi. Oleh sebab itu dilakukan penanggulangan Bencana. salah satu penanggulangan bencana adalah pemenuhan kebutuhan dasar. Penelitian ini bertujuan untuk mengumpulkan opini publik terhadap pemenuhan kebutuhan dasar di tempat pengungsian pasca erupsi merapi berdasarkan data Twitter. Algoritma yang digunakan dalam penelitian ini adalah Support Vector Machine untuk membangun model klasifikasi atas data yang sudah dikumpulkan.   Hasil yang diharapkan dari penelitian ini adalah mengetahui kebutuhan dasar dari suatu tempat pengungsian. Akurasi yang didapatkan dengan melakukan Cross Validation sebanyak 10 fold dari model klasifikasi Support Vector Machine87,96% dan Maximum Entropy 87,45 Kata Kunci: twitter, analisis sentimen, erupsi merapi, support vector machine


PLoS ONE ◽  
2020 ◽  
Vol 15 (12) ◽  
pp. e0237412
Author(s):  
Louisa-Marie Krützfeldt ◽  
Max Schubach ◽  
Martin Kircher

Regulatory regions, like promoters and enhancers, cover an estimated 5–15% of the human genome. Changes to these sequences are thought to underlie much of human phenotypic variation and a substantial proportion of genetic causes of disease. However, our understanding of their functional encoding in DNA is still very limited. Applying machine or deep learning methods can shed light on this encoding and gapped k-mer support vector machines (gkm-SVMs) or convolutional neural networks (CNNs) are commonly trained on putative regulatory sequences. Here, we investigate the impact of negative sequence selection on model performance. By training gkm-SVM and CNN models on open chromatin data and corresponding negative training dataset, both learners and two approaches for negative training data are compared. Negative sets use either genomic background sequences or sequence shuffles of the positive sequences. Model performance was evaluated on three different tasks: predicting elements active in a cell-type, predicting cell-type specific elements, and predicting elements' relative activity as measured from independent experimental data. Our results indicate strong effects of the negative training data, with genomic backgrounds showing overall best results. Specifically, models trained on highly shuffled sequences perform worse on the complex tasks of tissue-specific activity and quantitative activity prediction, and seem to learn features of artificial sequences rather than regulatory activity. Further, we observe that insufficient matching of genomic background sequences results in model biases. While CNNs achieved and exceeded the performance of gkm-SVMs for larger training datasets, gkm-SVMs gave robust and best results for typical training dataset sizes without the need of hyperparameter optimization.


2020 ◽  
Author(s):  
Louisa-Marie Krützfeldt ◽  
Max Schubach ◽  
Martin Kircher

AbstractRegulatory regions, like promoters and enhancers, cover an estimated 5-15% of the human genome. Changes to these sequences are thought to underlie much of human phenotypic variation and a substantial proportion of genetic causes of disease. However, our understanding of their functional encoding in DNA is still very limited. Applying machine or deep learning methods can shed light on this encoding and gapped k-mer support vector machines (gkm-SVMs) or convolutional neural networks (CNNs) are commonly trained on putative regulatory sequences.Here, we investigate the impact of negative sequence selection on model performance. By training gkm-SVM and CNN models on open chromatin data and corresponding negative training dataset, both learners and two approaches for negative training data are compared. Negative sets use either genomic background sequences or sequence shuffles of the positive sequences. Model performance was evaluated on three different tasks: predicting elements active in a cell-type, predicting cell-type specific elements, and predicting elements’ relative activity as measured from independent experimental data.Our results indicate strong effects of the negative training data, with genomic backgrounds showing overall best results. Specifically, models trained on highly shuffled sequences perform worse on the complex tasks of tissue-specific activity and quantitative activity prediction, and seem to learn features of artificial sequences rather than regulatory activity. Further, we observe that insufficient matching of genomic background sequences results in model biases. While CNNs achieved and exceeded the performance of gkm-SVMs for larger training datasets, gkm-SVMs gave robust and best results for typical training dataset sizes without the need of hyperparameter optimization.


Agriculture ◽  
2021 ◽  
Vol 11 (11) ◽  
pp. 1106
Author(s):  
Yan Hu ◽  
Lijia Xu ◽  
Peng Huang ◽  
Xiong Luo ◽  
Peng Wang ◽  
...  

A rapid and nondestructive tea classification method is of great significance in today’s research. This study uses fluorescence hyperspectral technology and machine learning to distinguish Oolong tea by analyzing the spectral features of tea in the wavelength ranging from 475 to 1100 nm. The spectral data are preprocessed by multivariate scattering correction (MSC) and standard normal variable (SNV), which can effectively reduce the impact of baseline drift and tilt. Then principal component analysis (PCA) and t-distribution random neighborhood embedding (t-SNE) are adopted for feature dimensionality reduction and visual display. Random Forest-Recursive Feature Elimination (RF-RFE) is used for feature selection. Decision Tree (DT), Random Forest Classification (RFC), K-Nearest Neighbor (KNN) and Support Vector Machine (SVM) are used to establish the classification model. The results show that MSC-RF-RFE-SVM is the best model for the classification of Oolong tea in which the accuracy of the training set and test set is 100% and 98.73%, respectively. It can be concluded that fluorescence hyperspectral technology and machine learning are feasible to classify Oolong tea.


Sign in / Sign up

Export Citation Format

Share Document