scholarly journals A Semi-Supervised Learning Approach for Tackling Twitter Spam Drift

Author(s):  
Niddal Imam ◽  
Biju Issac ◽  
Seibu Mary Jacob

Twitter has changed the way people get information by allowing them to express their opinion and comments on the daily tweets. Unfortunately, due to the high popularity of Twitter, it has become very attractive to spammers. Unlike other types of spam, Twitter spam has become a serious issue in the last few years. The large number of users and the high amount of information being shared on Twitter play an important role in accelerating the spread of spam. In order to protect the users, Twitter and the research community have been developing different spam detection systems by applying different machine-learning techniques. However, a recent study showed that the current machine learning-based detection systems are not able to detect spam accurately because spam tweet characteristics vary over time. This issue is called “Twitter Spam Drift”. In this paper, a semi-supervised learning approach (SSLA) has been proposed to tackle this. The new approach uses the unlabeled data to learn the structure of the domain. Different experiments were performed on English and Arabic datasets to test and evaluate the proposed approach and the results show that the proposed SSLA can reduce the effect of Twitter spam drift and outperform the existing techniques.

2016 ◽  
Author(s):  
Philippe Desjardins-Proulx ◽  
Idaline Laigle ◽  
Timothée Poisot ◽  
Dominique Gravel

0AbstractSpecies interactions are a key component of ecosystems but we generally have an incomplete picture of who-eats-who in a given community. Different techniques have been devised to predict species interactions using theoretical models or abundances. Here, we explore the K nearest neighbour approach, with a special emphasis on recommendation, along with other machine learning techniques. Recommenders are algorithms developed for companies like Netflix to predict if a customer would like a product given the preferences of similar customers. These machine learning techniques are well-suited to study binary ecological interactions since they focus on positive-only data. We also explore how the K nearest neighbour approach can be used with both positive and negative information, in which case the goal of the algorithm is to fill missing entries from a matrix (imputation). By removing a prey from a predator, we find that recommenders can guess the missing prey around 50% of the times on the first try, with up to 881 possibilities. Traits do not improve significantly the results for the K nearest neighbour, although a simple test with a supervised learning approach (random forests) show we can predict interactions with high accuracy using only three traits per species. This result shows that binary interactions can be predicted without regard to the ecological community given only three variables: body mass and two variables for the species’ phylogeny. These techniques are complementary, as recommenders can predict interactions in the absence of traits, using only information about other species’ interactions, while supervised learning algorithms such as random forests base their predictions on traits only but do not exploit other species’ interactions. Further work should focus on developing custom similarity measures specialized to ecology to improve the KNN algorithms and using richer data to capture indirect relationships between species.


Author(s):  
Rashida Ali ◽  
Ibrahim Rampurawala ◽  
Mayuri Wandhe ◽  
Ruchika Shrikhande ◽  
Arpita Bhatkar

Internet provides a medium to connect with individuals of similar or different interests creating a hub. Since a huge hub participates on these platforms, the user can receive a high volume of messages from different individuals creating a chaos and unwanted messages. These messages sometimes contain a true information and sometimes false, which leads to a state of confusion in the minds of the users and leads to first step towards spam messaging. Spam messages means an irrelevant and unsolicited message sent by a known/unknown user which may lead to a sense of insecurity among users. In this paper, the different machine learning algorithms were trained and tested with natural language processing (NLP) to classify whether the messages are spam or ham.


2020 ◽  
Vol 10 (15) ◽  
pp. 5208
Author(s):  
Mohammed Nasser Al-Mhiqani ◽  
Rabiah Ahmad ◽  
Z. Zainal Abidin ◽  
Warusia Yassin ◽  
Aslinda Hassan ◽  
...  

Insider threat has become a widely accepted issue and one of the major challenges in cybersecurity. This phenomenon indicates that threats require special detection systems, methods, and tools, which entail the ability to facilitate accurate and fast detection of a malicious insider. Several studies on insider threat detection and related areas in dealing with this issue have been proposed. Various studies aimed to deepen the conceptual understanding of insider threats. However, there are many limitations, such as a lack of real cases, biases in making conclusions, which are a major concern and remain unclear, and the lack of a study that surveys insider threats from many different perspectives and focuses on the theoretical, technical, and statistical aspects of insider threats. The survey aims to present a taxonomy of contemporary insider types, access, level, motivation, insider profiling, effect security property, and methods used by attackers to conduct attacks and a review of notable recent works on insider threat detection, which covers the analyzed behaviors, machine-learning techniques, dataset, detection methodology, and evaluation metrics. Several real cases of insider threats have been analyzed to provide statistical information about insiders. In addition, this survey highlights the challenges faced by other researchers and provides recommendations to minimize obstacles.


2019 ◽  
pp. 030573561987160 ◽  
Author(s):  
Manuel Anglada-Tort ◽  
Amanda E Krause ◽  
Adrian C North

The present study investigated how the gender distribution of the United Kingdom’s most popular artists has changed over time and the extent to which these changes might relate to popular music lyrics. Using data mining and machine learning techniques, we analyzed all songs that reached the UK weekly top 5 sales charts from 1960 to 2015 (4,222 songs). DICTION software facilitated a computerized analysis of the lyrics, measuring a total of 36 lyrical variables per song. Results showed a significant inequality in gender representation on the charts. However, the presence of female musicians increased significantly over the time span. The most critical inflection points leading to changes in the prevalence of female musicians were in 1968, 1976, and 1984. Linear mixed-effect models showed that the total number of words and the use of self-reference in popular music lyrics changed significantly as a function of musicians’ gender distribution over time, and particularly around the three critical inflection points identified. Irrespective of gender, there was a significant trend toward increasing repetition in the lyrics over time. Results are discussed in terms of the potential advantages of using machine learning techniques to study naturalistic singles sales charts data.


Author(s):  
Zhao Zhang ◽  
Yun Yuan ◽  
Xianfeng (Terry) Yang

Accurate and timely estimation of freeway traffic speeds by short segments plays an important role in traffic monitoring systems. In the literature, the ability of machine learning techniques to capture the stochastic characteristics of traffic has been proved. Also, the deployment of intelligent transportation systems (ITSs) has provided enriched traffic data, which enables the adoption of a variety of machine learning methods to estimate freeway traffic speeds. However, the limitation of data quality and coverage remain a big challenge in current traffic monitoring systems. To overcome this problem, this study aims to develop a hybrid machine learning approach, by creating a new training variable based on the second-order traffic flow model, to improve the accuracy of traffic speed estimation. Grounded on a novel integrated framework, the estimation is performed using three machine learning techniques, that is, Random Forest (RF), Extreme Gradient Boosting (XGBoost), and Artificial Neural Network (ANN). All three models are trained with the integrated dataset including the traffic flow model estimates and the iPeMS and PeMS data from the Utah Department of Transportation (DOT). Further using the PeMS data as the ground truth for model evaluation, the comparisons between the hybrid approach and pure machine learning models show that the hybrid approach can effectively capture the time-varying pattern of the traffic and help improve the estimation accuracy.


Sign in / Sign up

Export Citation Format

Share Document