Detection of Spam Bots on Twitter using Machine Learning

Twitter is a popularly used microblogging website that is used to share views, opinions, and updates. However, in recent times, an epidemic of spammer accounts have spread across the website causing disorder and chaos among the normal users. These spammers either aim to promote some commercial agenda or disturb the peace in the online environment. Our project aims to analyze the tweets made by users and predict if they might be spammers so that appropriate action can be taken on them. This is done using machine learning. The random forest algorithm has been modified by giving weighted importance to certain variables assigned using domain knowledge that has been obtained from exploratory analysis of various twitter data sets and knowledge from scientific research papers. A bag of words has also been added to the algorithm, in order to quickly identify the key phrases used by spam bots. By identifying the spammers we can systematically report them and create a more peaceful online environment.

Download Full-text

Regional Mapping of Vineyards Using Machine Learning and LiDAR Data

International Journal of Applied Geospatial Research ◽

10.4018/ijagr.2020100101 ◽

2020 ◽

Vol 11 (4) ◽

pp. 1-22

Author(s):

Adriaan Jacobus Prins ◽

Adriaan van Niekerk

Keyword(s):

Machine Learning ◽

Random Forest ◽

Learning Algorithms ◽

Window Size ◽

Machine Learning Algorithms ◽

Surface Model ◽

Lidar Data ◽

Data Sets ◽

Random Forest Algorithm ◽

Spectral Mixing

This study evaluates the use of LiDAR data and machine learning algorithms for mapping vineyards. Vineyards are planted in rows spaced at various distances, which can cause spectral mixing within individual pixels and complicate image classification. Four resolution where used for generating normalized digital surface model and intensity derivatives from the LiDAR data. In addition, texture measures with window sizes of 3x3 and 5x5 were generated from the LiDAR derivatives. The different combinations of the resolutions and window sizes resulted in eight data sets that were used as input to 11 machine learning algorithms. A larger window size was found to improve the overall accuracy for all the classifier–resolution combinations. The results showed that random forest with texture measures generated at a 5x5 window size outperformed the other experiments, regardless of the resolution used. The authors conclude that the random forest algorithm used on LiDAR derivatives with a resolution of 1.5m and a window size of 5x5 is the recommend configuration for vineyard mapping using LiDAR data.

Download Full-text

A Systematic Review of Defensive and Offensive Cybersecurity with Machine Learning

Applied Sciences ◽

10.3390/app10175811 ◽

2020 ◽

Vol 10 (17) ◽

pp. 5811

Author(s):

Imatitikua D. Aiyanyo ◽

Hamman Samuel ◽

Heuiseok Lim

Keyword(s):

Machine Learning ◽

Systematic Review ◽

Domain Knowledge ◽

Supervised Machine Learning ◽

Detection Methods ◽

Data Sets ◽

Research Topics ◽

Research Papers ◽

Learning Methods ◽

Machine Learning Methods

This is a systematic review of over one hundred research papers about machine learning methods applied to defensive and offensive cybersecurity. In contrast to previous reviews, which focused on several fragments of research topics in this area, this paper systematically and comprehensively combines domain knowledge into a single review. Ultimately, this paper seeks to provide a base for researchers that wish to delve into the field of machine learning for cybersecurity. Our findings identify the frequently used machine learning methods within supervised, unsupervised, and semi-supervised machine learning, the most useful data sets for evaluating intrusion detection methods within supervised learning, and methods from machine learning that have shown promise in tackling various threats in defensive and offensive cybersecurity.

Download Full-text

Classification of iron oxide aerosols by a single particle soot photometer using supervised machine learning

Atmospheric Measurement Techniques ◽

10.5194/amt-12-3885-2019 ◽

2019 ◽

Vol 12 (7) ◽

pp. 3885-3906 ◽

Cited By ~ 2

Author(s):

Kara D. Lamb

Keyword(s):

Machine Learning ◽

Random Forest ◽

Test Data ◽

Single Particle ◽

Broad Band ◽

Supervised Machine Learning ◽

Data Sets ◽

Specific Class ◽

Random Forest Algorithm

Abstract. Single particle soot photometers (SP2) use laser-induced incandescence to detect aerosols on a single particle basis. SP2s that have been modified to provide greater spectral contrast between their narrow and broad-band incandescent detectors have previously been used to characterize both refractory black carbon (rBC) and light-absorbing metallic aerosols, including iron oxides (FeOx). However, single particles cannot be unambiguously identified from their incandescent peak height (a function of particle mass) and color ratio (a measure of blackbody temperature) alone. Machine learning offers a promising approach for improving the classification of these aerosols. Here we explore the advantages and limitations of classifying single particle signals obtained with a modified SP2 using a supervised machine learning algorithm. Laboratory samples of different aerosols that incandesce in the SP2 (fullerene soot, mineral dust, volcanic ash, coal fly ash, Fe2O3, and Fe3O4) were used to train a random forest algorithm. The trained algorithm was then applied to test data sets of laboratory samples and atmospheric aerosols. This method provides a systematic approach for classifying incandescent aerosols by providing a score, or conditional probability, that a particle is likely to belong to a particular aerosol class (rBC, FeOx, etc.) given its observed single particle features. We consider two alternative approaches for identifying aerosols in mixed populations based on their single particle SP2 response: one with specific class labels for each species sampled, and one with three broader classes (rBC, anthropogenic FeOx, and dust-like) for particles with similar SP2 responses. Predictions of the most likely particle class (the one with the highest mean probability) based on applying the trained random forest algorithm to the single particle features for test data sets comprising examples of each class are compared with the true class for those particles to estimate generalization performance. While the specific class approach performed well for rBC and Fe3O4 (≥99 % of these aerosols are correctly identified), its classification of other aerosol types is significantly worse (only 47 %–66 % of other particles are correctly identified). Using the broader class approach, we find a classification accuracy of 99 % for FeOx samples measured in the laboratory. The method allows for classification of FeOx as anthropogenic or dust-like for aerosols with effective spherical diameters from 170 to >1200 nm. The misidentification of both dust-like aerosols and rBC as anthropogenic FeOx is small, with <3 % of the dust-like aerosols and <0.1 % of rBC misidentified as FeOx for the broader class case. When applying this method to atmospheric observations taken in Boulder, CO, a clear mode consistent with FeOx was observed, distinct from dust-like aerosols.

Download Full-text

Classification and photometric redshift estimation of quasars in photometric surveys

Proceedings of the International Astronomical Union ◽

10.1017/s1743921320001829 ◽

2020 ◽

Vol 15 (S359) ◽

pp. 40-41

Author(s):

L. M. Izuti Nakazono ◽

C. Mendes de Oliveira ◽

N. S. T. Hirata ◽

S. Jeram ◽

A. Gonzalez ◽

...

Keyword(s):

Machine Learning ◽

Random Forest ◽

Nearest Neighbour ◽

Random Forest Algorithm ◽

Photometric Redshift ◽

Using Data

AbstractWe present a machine learning methodology to separate quasars from galaxies and stars using data from S-PLUS in the Stripe-82 region. In terms of quasar classification, we achieved 95.49% for precision and 95.26% for recall using a Random Forest algorithm. For photometric redshift estimation, we obtained a precision of 6% using k-Nearest Neighbour.

Download Full-text

A novel framework for designing a multi-DoF prosthetic wrist control using machine learning

Scientific Reports ◽

10.1038/s41598-021-94449-1 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Chinmay P. Swami ◽

Nicholas Lenhard ◽

Jiyeon Kang

Keyword(s):

Machine Learning ◽

Random Forest ◽

Upper Limb ◽

Daily Living ◽

Machine Learning Algorithms ◽

Data Sets ◽

Random Forest Regression ◽

Prosthetic Devices ◽

Upper Limb Function ◽

The Neural Network

AbstractProsthetic arms can significantly increase the upper limb function of individuals with upper limb loss, however despite the development of various multi-DoF prosthetic arms the rate of prosthesis abandonment is still high. One of the major challenges is to design a multi-DoF controller that has high precision, robustness, and intuitiveness for daily use. The present study demonstrates a novel framework for developing a controller leveraging machine learning algorithms and movement synergies to implement natural control of a 2-DoF prosthetic wrist for activities of daily living (ADL). The data was collected during ADL tasks of ten individuals with a wrist brace emulating the absence of wrist function. Using this data, the neural network classifies the movement and then random forest regression computes the desired velocity of the prosthetic wrist. The models were trained/tested with ADLs where their robustness was tested using cross-validation and holdout data sets. The proposed framework demonstrated high accuracy (F-1 score of 99% for the classifier and Pearson’s correlation of 0.98 for the regression). Additionally, the interpretable nature of random forest regression was used to verify the targeted movement synergies. The present work provides a novel and effective framework to develop an intuitive control for multi-DoF prosthetic devices.

Download Full-text

Effects of air quality on the health of Mediterranean forests

10.5194/egusphere-egu21-16171 ◽

2021 ◽

Author(s):

Adrián García Bruzón ◽

Patricia Arrogante Funes ◽

Laura Muñoz Moral

Keyword(s):

Climate Change ◽

Machine Learning ◽

Random Forest ◽

Aridity Index ◽

Plant Health ◽

Mediterranean Forests ◽

Random Forest Algorithm ◽

The Mediterranean ◽

Heterogeneous Variables ◽

Peninsular Spain

The climate change has turned out to be a determining factor in the development of forest in Spain. Production systems have emitted polluting gases and other particles into the atmosphere, for which some plants have not yet developed adaptation systems. Among the most harmful pollutants for the environment are gases such as nitrous oxides, ozone, particulate matter.However, this condition is not the same in Peninsular Spain, and the Balearic Islands since the plant compositions differ in the territory and the bioclimatic, topographic, and anthropic characteristics. Monitoring the vegetation with sufficient spatial and temporal resolution, studying variables conditioning plant health is a challenge from the nature of the variables and the amount of data to be handled.&#160;The Mediterranean forest is one of the most ecosystem affected by climate change because of usually experimented long periods of drought that, in combination with increased temperatures, can drastically reduce the photosynthetic activity of trees and therefore the biomass of forests.That is why the application of environmental technologies based on Remote Sensing (which provide plant health indices from passive sensors on satellite platforms and other variables of interest), Geographic Information Systems (to integrate, process, analyze spatial and temporal data) and machine learning models (which facilitate the extraction of relationships between variables, conditioning factors and predict patterns).&#160;In this regard, this work's objective is to evaluate the possible effect that different pollutants have on the health of the vegetation, measured from the annual values of the Normalized Difference Vegetation Index (NDVI), in the Mediterranean forests of Peninsular Spain. To achieve this, we are used machine learning techniques using the Random Forest algorithm. The study has also been done with various climatic, topographic, and anthropic variables that characterize the forest to carry it out.&#160;The results showed that certain variables such as the aridity index had generated the NDVI values and therefore plant development, while others are limiting factors such as the concentration of certain pollutants and the direct relationship between them particulates and NOx. This study can verify how the Random Forest algorithm offers reliable results, even when working with heterogeneous variables.&#160;

Download Full-text

Sentiment Analysis on Twitter Data of World Cup Soccer Tournament Using Machine Learning

IoT ◽

10.3390/iot1020014 ◽

2020 ◽

Vol 1 (2) ◽

pp. 218-239 ◽

Cited By ~ 2

Author(s):

Ravikumar Patel ◽

Kalpdrum Passi

Keyword(s):

Machine Learning ◽

Natural Language Processing ◽

Random Forest ◽

Natural Language ◽

Language Processing ◽

Machine Learning Algorithms ◽

World Cup ◽

Part Of Speech ◽

Twitter Data ◽

Processing Techniques

In the derived approach, an analysis is performed on Twitter data for World Cup soccer 2014 held in Brazil to detect the sentiment of the people throughout the world using machine learning techniques. By filtering and analyzing the data using natural language processing techniques, sentiment polarity was calculated based on the emotion words detected in the user tweets. The dataset is normalized to be used by machine learning algorithms and prepared using natural language processing techniques like word tokenization, stemming and lemmatization, part-of-speech (POS) tagger, name entity recognition (NER), and parser to extract emotions for the textual data from each tweet. This approach is implemented using Python programming language and Natural Language Toolkit (NLTK). A derived algorithm extracts emotional words using WordNet with its POS (part-of-speech) for the word in a sentence that has a meaning in the current context, and is assigned sentiment polarity using the SentiWordNet dictionary or using a lexicon-based method. The resultant polarity assigned is further analyzed using naïve Bayes, support vector machine (SVM), K-nearest neighbor (KNN), and random forest machine learning algorithms and visualized on the Weka platform. Naïve Bayes gives the best accuracy of 88.17% whereas random forest gives the best area under the receiver operating characteristics curve (AUC) of 0.97.

Download Full-text

Prediction of novel mouse TLR9 agonists using a random forest approach

BMC Molecular and Cell Biology ◽

10.1186/s12860-019-0241-0 ◽

2019 ◽

Vol 20 (S2) ◽

Author(s):

Varun Khanna ◽

Lei Li ◽

Johnson Fung ◽

Shoba Ranganathan ◽

Nikolai Petrovsky

Keyword(s):

Machine Learning ◽

Random Forest ◽

Correlation Coefficient ◽

Matthews Correlation Coefficient ◽

Learning Algorithms ◽

Ensemble Classifier ◽

Innate Immune ◽

Machine Learning Algorithms ◽

Support Vector ◽

Random Forest Algorithm

Abstract Background Toll-like receptor 9 is a key innate immune receptor involved in detecting infectious diseases and cancer. TLR9 activates the innate immune system following the recognition of single-stranded DNA oligonucleotides (ODN) containing unmethylated cytosine-guanine (CpG) motifs. Due to the considerable number of rotatable bonds in ODNs, high-throughput in silico screening for potential TLR9 activity via traditional structure-based virtual screening approaches of CpG ODNs is challenging. In the current study, we present a machine learning based method for predicting novel mouse TLR9 (mTLR9) agonists based on features including count and position of motifs, the distance between the motifs and graphically derived features such as the radius of gyration and moment of Inertia. We employed an in-house experimentally validated dataset of 396 single-stranded synthetic ODNs, to compare the results of five machine learning algorithms. Since the dataset was highly imbalanced, we used an ensemble learning approach based on repeated random down-sampling. Results Using in-house experimental TLR9 activity data we found that random forest algorithm outperformed other algorithms for our dataset for TLR9 activity prediction. Therefore, we developed a cross-validated ensemble classifier of 20 random forest models. The average Matthews correlation coefficient and balanced accuracy of our ensemble classifier in test samples was 0.61 and 80.0%, respectively, with the maximum balanced accuracy and Matthews correlation coefficient of 87.0% and 0.75, respectively. We confirmed common sequence motifs including ‘CC’, ‘GG’,‘AG’, ‘CCCG’ and ‘CGGC’ were overrepresented in mTLR9 agonists. Predictions on 6000 randomly generated ODNs were ranked and the top 100 ODNs were synthesized and experimentally tested for activity in a mTLR9 reporter cell assay, with 91 of the 100 selected ODNs showing high activity, confirming the accuracy of the model in predicting mTLR9 activity. Conclusion We combined repeated random down-sampling with random forest to overcome the class imbalance problem and achieved promising results. Overall, we showed that the random forest algorithm outperformed other machine learning algorithms including support vector machines, shrinkage discriminant analysis, gradient boosting machine and neural networks. Due to its predictive performance and simplicity, the random forest technique is a useful method for prediction of mTLR9 ODN agonists.

Download Full-text

Research on machine learning framework based on random forest algorithm

10.1063/1.4977376 ◽

2017 ◽

Cited By ~ 5

Author(s):

Qiong Ren ◽

Hui Cheng ◽

Hai Han

Keyword(s):

Machine Learning ◽

Random Forest ◽

Random Forest Algorithm ◽

Learning Framework

Download Full-text

An Analytical Model for Prediction of Heart Disease using Machine Learning Classifiers

10.36227/techrxiv.14867175 ◽

2021 ◽

Author(s):

Diti Roy ◽

Md. Ashiq Mahmood ◽

Tamal Joyti Roy

Keyword(s):

Machine Learning ◽

Heart Disease ◽

Random Forest ◽

Learning Algorithm ◽

Modern Technology ◽

Learning Approach ◽

Data Sets ◽

Machine Learning Classifiers ◽

Machine Learning Approach ◽

Day By Day

Heart Disease is the most dominating disease which is taking a large number of deaths every year. A report from WHO in 2016 portrayed that every year at least 17 million people die of heart disease. This number is gradually increasing day by day and WHO estimated that this death toll will reach the summit of 75 million by 2030. Despite having modern technology and health care system predicting heart disease is still beyond limitations. As the Machine Learning algorithm is a vital source predicting data from available data sets we have used a machine learning approach to predict heart disease. We have collected data from the UCI repository. In our study, we have used Random Forest, Zero R, Voted Perceptron, K star classifier. We have got the best result through the Random Forest classifier with an accuracy of 97.69.

Download Full-text