Efficient Detection of Phising Hyperlinks using Machine Learning

Phishing is a type of Social Engineering cyber-attack, hackers use it to gain access to confidential credentials like bank account credentials details, details of their personal life like debit card details, social media credentials, etc. Phishing website links seem to seem just like the genuine ones and it's a tedious and troublesome task to differentiate among those websites. In this paper, features are extracted from a separate dataset of phishing and benign website URLs and then using the Machine Learning method we determine the phishing websites. We also rank the features based on the contribution of each feature used in determining the outcome of a URL link using built python libraries. Most of the phishing URLs use a large URL length when used for an attack. Hence, we proposed three machine learning models Random Forest, Support Vector Machine (SVM), Decision trees models for the efficient detection of phishing using fake URLs. The performance of the models is also compared among themselves using a confusion matrix to determine the highest performance. The implemented models have shown an accuracy of 84.81 (for Random Forest and SVM),83.96 (Decision tree)

Download Full-text

A Comparative Analysis of Machine Learning Models for Prediction of Insurance Uptake in Kenya

10.20944/preprints202010.0186.v1 ◽

2020 ◽

Author(s):

Nelson Yego ◽

Juma Kasozi ◽

Joseph Nkrunziza

Keyword(s):

Machine Learning ◽

Random Forest ◽

Characteristic Curve ◽

Confusion Matrix ◽

Gradient Boosting ◽

Support Vector ◽

Sampled Data ◽

Learning Models ◽

Extreme Gradient Boosting ◽

Machine Learning Models

The role of insurance in financial inclusion as well as in economic growth is immense. However, low uptake seems to impede the growth of the sector hence the need for a model that robustly predicts uptake of insurance among potential clients. In this research, we compared the performances of eight (8) machine learning models in predicting the uptake of insurance. The classifiers considered were Logistic Regression, Gaussian Naive Bayes, Support Vector Machines, K Nearest Neighbors, Decision Tree, Random Forest, Gradient Boosting Machines and Extreme Gradient boosting. The data used in the classification was from the 2016 Kenya FinAccess Household Survey. Comparison of performance was done for both upsampled and downsampled data due to data imbalance. For upsampled data, Random Forest classifier showed highest accuracy and precision compared to other classifiers but for down sampled data, gradient boosting was optimal. It is noteworthy that for both upsampled and downsampled data, tree-based classifiers were more robust than others in insurance uptake prediction. However, in spite of hyper-parameter optimization, the area under receiver operating characteristic curve remained highest for Random Forest as compared to other tree-based models. Also, the confusion matrix for Random Forest showed least false positives, and highest true positives hence could be construed as the most robust model for predicting the insurance uptake. Finally, the most important feature in predicting uptake was having a bank product hence bancassurance could be said to be a plausible channel of distribution of insurance products.

Download Full-text

PREDICTING TIME BEFORE THE NEXT ORDER IN THE ONLINE STORE, BASED ON MACHINE LEARNING METHODS

10.32782/2224-6282/161-27 ◽

2020 ◽

Author(s):

Olena Piskunova ◽

◽

Rostyslav Klochko ◽

Keyword(s):

Machine Learning ◽

Comparative Analysis ◽

Random Forest ◽

Rapid Development ◽

Confusion Matrix ◽

Individual Characteristics ◽

Support Vector ◽

Learning Methods ◽

Online Store ◽

Machine Learning Methods

Due to the rapid development of e-commerce and increased competition in the retail market of Ukraine, companies are forced to look for new ways to grow their business. One of the options is to optimize business processes, in particular to increase the efficiency of marketing activities. Predicting consumer behavior is one of the most effective methods of optimizing marketing budgets by building processes based on the individual characteristics of each client. The aim of the study was to predict the behavior of online store customers, namely the time before the next order, based on machine learning methods and a comparative analysis of the effectiveness of different modeling algorithms. Five classification algorithms were implemented: linear discriminant analysis, сlassification and regression trees, random forest, support vector machine, k - nearest neighbors and comparative analysis of their efficiency was performed. Given the peculiarities of customer behavior for forecasting time to the next order, it is proposed to consider the following time intervals in the future when the customer makes the next order: up to two months, two to six months, six to fifteen months, and without order. Predicting such intervals allows us to identify customers who are more likely to make the next purchase and focus our advertising budgets on them, or build a customer experience management strategy: activate customers who have left, offer discounts to customers who are going to leave. Peculiarities of classification models quality assessment on the basis of the “confusion matrix” according to the forecasting accuracy indicators “Accuracy”, “F1”, “Recall” and “Precision” is considered. The study allowed us to give preference to the model of classification "random forest". A tenfold cross-validation was used to improve the quality of the simulation. The weighted accuracy of “F1” in the groups “Up to two months” and “two-six months” reached 62.5% and 64.1%, respectively. The developed model should reduce the influence of the human factor on the decision-making process in the construction of marketing strategies.

Download Full-text

Detecting Malicious Apps in Android Devices using SVM, Random Forest & Decision Trees

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.a2418.059120 ◽

2020 ◽

Vol 8 (5) ◽

pp. 1414-1417

Keyword(s):

Machine Learning ◽

Random Forest ◽

Decision Tree ◽

Machine Learning Algorithms ◽

Training Dataset ◽

Support Vector ◽

Android Application ◽

Android Apps ◽

Efficient Detection ◽

Malicious Apps

In recent years, the usages of smart phones are increasing steadily and also growth of Android application users are increasing. Due to growth of Android application user, some intruder are creating malicious android application as tool to steal the sensitive data. We need an effective and efficient malicious applications detection tool to handle new complex malicious apps created by intruder or hackers. This project deals with idea of using machine learning approaches for detecting the malicious android application. First we have to gather dataset of past malicious apps as training set and with the help of Support vector machine algorithm and decision tree algorithm make up comparison with training dataset and trained dataset we can predict the malware android apps upto 93.2 % unknown / New malware mobile application. By implementing SIGPID, Significant Permission Identification (SIGPID).The goal of the sigid is to improve the apps permissions effectively and efficiently. This SIGPID system improves the accuracy and efficient detection of malware application. With the help of machine learning algorithms such as SVM, Random Forest Classifier and Decision Tree algorithms we make a comparison between training dataset and trained dataset to classify malicious application and benign app.

Download Full-text

Predictive Modelling of Employee Turnover in Indian IT Industry Using Machine Learning Techniques

Vision The Journal of Business Perspective ◽

10.1177/0972262918821221 ◽

2019 ◽

Vol 23 (1) ◽

pp. 12-21 ◽

Cited By ~ 2

Author(s):

Shikha N. Khera ◽

Divya

Keyword(s):

Machine Learning ◽

Learning Algorithm ◽

Confusion Matrix ◽

Predictive Modelling ◽

Supervised Machine Learning ◽

Machine Learning Techniques ◽

Support Vector ◽

It Industry ◽

Knowledge Based ◽

Employee Attrition

Information technology (IT) industry in India has been facing a systemic issue of high attrition in the past few years, resulting in monetary and knowledge-based loses to the companies. The aim of this research is to develop a model to predict employee attrition and provide the organizations opportunities to address any issue and improve retention. Predictive model was developed based on supervised machine learning algorithm, support vector machine (SVM). Archival employee data (consisting of 22 input features) were collected from Human Resource databases of three IT companies in India, including their employment status (response variable) at the time of collection. Accuracy results from the confusion matrix for the SVM model showed that the model has an accuracy of 85 per cent. Also, results show that the model performs better in predicting who will leave the firm as compared to predicting who will not leave the company.

Download Full-text

Efficient detection of hacker community based on twitter data using complex networks and machine learning algorithm

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-210458 ◽

2021 ◽

pp. 1-17

Author(s):

Ahmed Al-Tarawneh ◽

Ja’afer Al-Saraireh

Keyword(s):

Machine Learning ◽

Complex Networks ◽

Nearest Neighbor ◽

Learning Algorithm ◽

Machine Learning Algorithms ◽

Machine Learning Techniques ◽

Support Vector ◽

K Nearest Neighbor ◽

Efficient Detection ◽

Suggested Keywords

Twitter is one of the most popular platforms used to share and post ideas. Hackers and anonymous attackers use these platforms maliciously, and their behavior can be used to predict the risk of future attacks, by gathering and classifying hackers’ tweets using machine-learning techniques. Previous approaches for detecting infected tweets are based on human efforts or text analysis, thus they are limited to capturing the hidden text between tweet lines. The main aim of this research paper is to enhance the efficiency of hacker detection for the Twitter platform using the complex networks technique with adapted machine learning algorithms. This work presents a methodology that collects a list of users with their followers who are sharing their posts that have similar interests from a hackers’ community on Twitter. The list is built based on a set of suggested keywords that are the commonly used terms by hackers in their tweets. After that, a complex network is generated for all users to find relations among them in terms of network centrality, closeness, and betweenness. After extracting these values, a dataset of the most influential users in the hacker community is assembled. Subsequently, tweets belonging to users in the extracted dataset are gathered and classified into positive and negative classes. The output of this process is utilized with a machine learning process by applying different algorithms. This research build and investigate an accurate dataset containing real users who belong to a hackers’ community. Correctly, classified instances were measured for accuracy using the average values of K-nearest neighbor, Naive Bayes, Random Tree, and the support vector machine techniques, demonstrating about 90% and 88% accuracy for cross-validation and percentage split respectively. Consequently, the proposed network cyber Twitter model is able to detect hackers, and determine if tweets pose a risk to future institutions and individuals to provide early warning of possible attacks.

Download Full-text

Machine Learning-Based Prediction of Air Quality

Applied Sciences ◽

10.3390/app10249151 ◽

2020 ◽

Vol 10 (24) ◽

pp. 9151

Author(s):

Yun-Chia Liang ◽

Yona Maimury ◽

Angela Hsiang-Ling Chen ◽

Josue Rodolfo Cuevas Juarez

Keyword(s):

Machine Learning ◽

Air Quality ◽

Random Forest ◽

Prediction Models ◽

Superior Performance ◽

Support Vector ◽

Economic Activities ◽

Adaptive Boosting ◽

Series Of Experiments ◽

Artificial Neural Network Ann

Air, an essential natural resource, has been compromised in terms of quality by economic activities. Considerable research has been devoted to predicting instances of poor air quality, but most studies are limited by insufficient longitudinal data, making it difficult to account for seasonal and other factors. Several prediction models have been developed using an 11-year dataset collected by Taiwan’s Environmental Protection Administration (EPA). Machine learning methods, including adaptive boosting (AdaBoost), artificial neural network (ANN), random forest, stacking ensemble, and support vector machine (SVM), produce promising results for air quality index (AQI) level predictions. A series of experiments, using datasets for three different regions to obtain the best prediction performance from the stacking ensemble, AdaBoost, and random forest, found the stacking ensemble delivers consistently superior performance for R2 and RMSE, while AdaBoost provides best results for MAE.

Download Full-text

BioSignal modelling for prediction of cardiac diseases using intra group selection method

Intelligent Decision Technologies ◽

10.3233/idt-200058 ◽

2021 ◽

Vol 15 (1) ◽

pp. 151-160

Author(s):

Hemant P. Kasturiwale ◽

Sujata N. Kale

Keyword(s):

Machine Learning ◽

Nervous System ◽

Random Forest ◽

Normal Sinus Rhythm ◽

Heart Defects ◽

Support Vector ◽

Autonomous Nervous System ◽

Cardiac Diseases ◽

Vast Number ◽

Proposed Model

The Autonomous Nervous System (ANS) controls the nervous system and Heart Rate Variability (HRV) can be used as a diagnostic tool to diagnose heart defects. HRV can be classified into linear and nonlinear HRV indices which are used mostly to measure the efficiency of the model. For prediction of cardiac diseases, the selection and extraction features of machine learning model are effective. The available model used till date is based on HRV indices to predict the cardiac diseases accurately. The model could hardly throw light on specifics of indices, selection process and stability of the model. The proposed model is developed considering all facet electrocardiogram amplitude (ECG), frequency components, sampling frequency, extraction methods and acquisition techniques. The machine learning based model and its performance shall be tested using the standard BioSignal method, both on the data available and on the data obtained by the author. This is unique model developed by considering the vast number of mixtures sets and more than four complex cardiac classes. The statistical analysis is performed on a variety of databases such as MIT/BIH Normal Sinus Rhythm (NSR), MIT/BIH Arrhythmia (AR) and MIT/BIH Atrial Fibrillation (AF) and Peripheral Pule Analyser using feature compatibility techniques. The classifiers are trained for prediction with approximately 40000 sets of parameters. The proposed model reaches an average accuracy of 97.87 percent and is sensitive and précised. The best features are chosen from the different HRV features that will be used for classification. The present model was checked under all possible subject scenarios, such as the raw database and the non-ECG signal. In this sense, robustness is defined not only by the specificity parameter, but also by other measuring output parameters. Support Vector Machine (SVM), K-nearest Neighbour (KNN), Ensemble Adaboost (EAB) with Random Forest (RF) are tested in a 5% higher precision band and a lower band configuration. The Random Forest has produced better results, and its robustness has been established.

Download Full-text

A Two Layer Machine Learning System for Intrusion Detection Based on Random Forest and Support Vector Machine

2020 IEEE International Women in Engineering (WIE) Conference on Electrical and Computer Engineering (WIECON-ECE) ◽

10.1109/wiecon-ece52138.2020.9397945 ◽

2020 ◽

Author(s):

Sabrina Afroz ◽

S.M Ariful Islam ◽

Samin Nawer Rafa ◽

Maheen Islam

Keyword(s):

Machine Learning ◽

Support Vector Machine ◽

Random Forest ◽

Intrusion Detection ◽

Learning System ◽

Support Vector

Download Full-text

Evaluating Downscaling Factors of Microwave Satellite Soil Moisture Based on Machine Learning Method

Remote Sensing ◽

10.3390/rs13010133 ◽

2021 ◽

Vol 13 (1) ◽

pp. 133

Author(s):

Hao Sun ◽

Yajing Cui

Keyword(s):

Machine Learning ◽

Soil Moisture ◽

Land Surface ◽

Regional Scale ◽

Support Vector ◽

Machine Learning Method ◽

Learning Method ◽

Feed Forward Neural Network ◽

Comparative Performance ◽

Geographical Factors

Downscaling microwave remotely sensed soil moisture (SM) is an effective way to obtain spatial continuous SM with fine resolution for hydrological and agricultural applications on a regional scale. Downscaling factors and functions are two basic components of SM downscaling where the former is particularly important in the era of big data. Based on machine learning method, this study evaluated Land Surface Temperature (LST), Land surface Evaporative Efficiency (LEE), and geographical factors from Moderate Resolution Imaging Spectroradiometer (MODIS) products for downscaling SMAP (Soil Moisture Active and Passive) SM products. This study spans from 2015 to the end of 2018 and locates in the central United States. Original SMAP SM and in-situ SM at sparse networks and core validation sites were used as reference. Experiment results indicated that (1) LEE presented comparative performance with LST as downscaling factors; (2) adding geographical factors can significantly improve the performance of SM downscaling; (3) integrating LST, LEE, and geographical factors got the best performance; (4) using Z-score normalization or hyperbolic-tangent normalization methods did not change the above conclusions, neither did using support vector regression nor feed forward neural network methods. This study demonstrates the possibility of LEE as an alternative of LST for downscaling SM when there is no available LST due to cloud contamination. It also provides experimental evidence for adding geographical factors in the downscaling process.

Download Full-text

Possibility of Autonomous Estimation of Shiba Goat’s Estrus and Non-Estrus Behavior by Machine Learning Methods

Animals ◽

10.3390/ani10050771 ◽

2020 ◽

Vol 10 (5) ◽

pp. 771

Author(s):

Toshiya Arakawa

Keyword(s):

Neural Network ◽

Machine Learning ◽

Random Forest ◽

Markov Models ◽

Tracking System ◽

Video Tracking ◽

Training Data ◽

Support Vector ◽

Learning Methods ◽

Machine Learning Methods

Mammalian behavior is typically monitored by observation. However, direct observation requires a substantial amount of effort and time, if the number of mammals to be observed is sufficiently large or if the observation is conducted for a prolonged period. In this study, machine learning methods as hidden Markov models (HMMs), random forests, support vector machines (SVMs), and neural networks, were applied to detect and estimate whether a goat is in estrus based on the goat’s behavior; thus, the adequacy of the method was verified. Goat’s tracking data was obtained using a video tracking system and used to estimate whether they, which are in “estrus” or “non-estrus”, were in either states: “approaching the male”, or “standing near the male”. Totally, the PC of random forest seems to be the highest. However, The percentage concordance (PC) value besides the goats whose data were used for training data sets is relatively low. It is suggested that random forest tend to over-fit to training data. Besides random forest, the PC of HMMs and SVMs is high. However, considering the calculation time and HMM’s advantage in that it is a time series model, HMM is better method. The PC of neural network is totally low, however, if the more goat’s data were acquired, neural network would be an adequate method for estimation.

Download Full-text