ANALYSIS OF SINGLE AND ENSEMBLE MACHINE LEARNING CLASSIFIERS FOR PHISHING ATTACKS DETECTION

Phishing attacks have been used in different ways to harvest the confidential information of unsuspecting internet users. To stem the tide of phishing-based attacks, several machine learning techniques have been proposed in the past. However, fewer studies have considered investigating single and ensemble machine learning-based models for the classification of phishing attacks. This study carried out performance analysis of selected single and ensemble machine learning (ML) classifiers in phishing classification.The focus is to investigate how these algorithms behave in the classification of phishing attacks in the chosen dataset. Logistic Regression and Decision Trees were chosen as single learning classifiers while simple voting techniques and Random Forest were used as the ensemble machine learning algorithms. Accuracy, Precision, Recall and F1-score were used as performance metrics. Logistic Regression algorithm recorded 0.86 as accuracy, 0.89 as precision, 0.87 as recall and 0.81 as F1-score. Similarly, the Decision Trees classifier achieved an accuracy of 0.87, 0.83 for precision, 0.88 for recall and 0.81 for F1-score. In the voting ensemble, accuracy of 0.92 was achieved. 0.90 was obtained for precision, 0.92 for recall and 0.92 for F1-score. Random Forest algorithm recorded 0.98, 0.97, 0.98 and 0.97 as accuracy, precision, recall and F1-score respectively. From the experimental analyses, Random Forest algorithm outperformed simple averaging classifier and the two single algorithms used for phishing url detection. The study established that the ensemble techniques that were used for the experimentations are more efficient for phishing url identification compared to the single classifiers.

Download Full-text

Development of an ensemble machine learning prognostic model to predict 60-day risk of major adverse cardiac events in adults with chest pain

10.1101/2021.03.08.21252615 ◽

2021 ◽

Author(s):

Chris J. Kennedy ◽

Dustin G. Mark ◽

Jie Huang ◽

Mark J. van der Laan ◽

Alan E. Hubbard ◽

...

Keyword(s):

Machine Learning ◽

Logistic Regression ◽

Chest Pain ◽

Random Forest ◽

Decision Trees ◽

Low Risk ◽

Major Adverse Cardiac Events ◽

Risk Scores ◽

Cardiac Events ◽

Adverse Cardiac Events

Background: Chest pain is the second leading reason for emergency department (ED) visits and is commonly identified as a leading driver of low-value health care. Accurate identification of patients at low risk of major adverse cardiac events (MACE) is important to improve resource allocation and reduce over-treatment. Objectives: We sought to assess machine learning (ML) methods and electronic health record (EHR) covariate collection for MACE prediction. We aimed to maximize the pool of low-risk patients that are accurately predicted to have less than 0.5% MACE risk and may be eligible for reduced testing. Population Studied: 116,764 adult patients presenting with chest pain in the ED and evaluated for potential acute coronary syndrome (ACS). 60-day MACE rate was 1.9%. Methods: We evaluated ML algorithms (lasso, splines, random forest, extreme gradient boosting, Bayesian additive regression trees) and SuperLearner stacked ensembling. We tuned ML hyperparameters through nested ensembling, and imputed missing values with generalized low-rank models (GLRM). We benchmarked performance to key biomarkers, validated clinical risk scores, decision trees, and logistic regression. We explained the models through variable importance ranking and accumulated local effect visualization. Results: The best discrimination (area under the precision-recall [PR-AUC] and receiver operating characteristic [ROC-AUC] curves) was provided by SuperLearner ensembling (0.148, 0.867), followed by random forest (0.146, 0.862). Logistic regression (0.120, 0.842) and decision trees (0.094, 0.805) exhibited worse discrimination, as did risk scores [HEART (0.064, 0.765), EDACS (0.046, 0.733)] and biomarkers [serum troponin level (0.064, 0.708), electrocardiography (0.047, 0.686)]. The ensemble's risk estimates were miscalibrated by 0.2 percentage points. The ensemble accurately identified 50% of patients to be below a 0.5% 60-day MACE risk threshold. The most important predictors were age, peak troponin, HEART score, EDACS score, and electrocardiogram. GLRM imputation achieved 90% reduction in root mean-squared error compared to median-mode imputation. Conclusion: Use of ML algorithms, combined with broad predictor sets, improved MACE risk prediction compared to simpler alternatives, while providing calibrated predictions and interpretability. Standard risk scores may neglect important health information available in other characteristics and combined in nuanced ways via ML.

Download Full-text

Oropharyngeal squamous cell carcinoma: radiomic machine-learning classifiers from multiparametric MR images for determination of HPV infection status

Scientific Reports ◽

10.1038/s41598-020-74479-x ◽

2020 ◽

Vol 10 (1) ◽

Author(s):

Chong Hyun Suh ◽

Kyung Hwa Lee ◽

Young Jun Choi ◽

Sae Rom Chung ◽

Jung Hwan Baek ◽

...

Keyword(s):

Machine Learning ◽

Squamous Cell Carcinoma ◽

Logistic Regression ◽

Random Forest ◽

Cell Carcinoma ◽

Squamous Cell ◽

Oropharyngeal Squamous Cell Carcinoma ◽

Machine Learning Classifiers ◽

Learning Classifiers ◽

Hpv Status

Abstract We investigated the ability of machine-learning classifiers on radiomics from pre-treatment multiparametric magnetic resonance imaging (MRI) to accurately predict human papillomavirus (HPV) status in patients with oropharyngeal squamous cell carcinoma (OPSCC). This retrospective study collected data of 60 patients (48 HPV-positive and 12 HPV-negative) with newly diagnosed histopathologically proved OPSCC, who underwent head and neck MRIs consisting of axial T1WI, T2WI, CE-T1WI, and apparent diffusion coefficient (ADC) maps from diffusion-weighted imaging (DWI). The median age was 59 years (the range being 35 to 85 years), and 83.3% of patients were male. The imaging data were randomised into a training set (32 HPV-positive and 8 HPV-negative OPSCC) and a test set (16 HPV-positive and 4 HPV-negative OPSCC) in each fold. 1618 quantitative features were extracted from manually delineated regions-of-interest of primary tumour and one definite lymph node in each sequence. After feature selection by using the least absolute shrinkage and selection operator (LASSO), three different machine-learning classifiers (logistic regression, random forest, and XG boost) were trained and compared in the setting of various combinations between four sequences. The highest diagnostic accuracies were achieved when using all sequences, and the difference was significant only when the combination did not include the ADC map. Using all sequences, logistic regression and the random forest classifier yielded higher accuracy compared with the that of the XG boost classifier, with mean area under curve (AUC) values of 0.77, 0.76, and 0.71, respectively. The machine-learning classifier of non-invasive and quantitative radiomics signature could guide the classification of the HPV status.

Download Full-text

Statistical and machine learning models for classification of human wear and delivery days in accelerometry data

10.1101/2020.12.31.424867 ◽

2021 ◽

Author(s):

Ryan Moore ◽

Kristin R. Archer ◽

Leena Choi

Keyword(s):

Neural Network ◽

Machine Learning ◽

Logistic Regression ◽

Random Forest ◽

Human Activity ◽

Recurrent Neural Network ◽

Learning Models ◽

Learning Context ◽

Machine Learning Models

AbstractPurposeAccelerometers are increasingly utilized in healthcare research to assess human activity. Accelerometry data are often collected by mailing accelerometers to participants, who wear the accelerometers to collect data on their activity. The devices are then mailed back to the laboratory for analysis. We develop models to classify days in accelerometry data as activity from actual human wear or the delivery process. These models can be used to automate the cleaning of accelerometry datasets that are adulterated with activity from delivery.MethodsFor the classification of delivery days in accelerometry data, we developed statistical and machine learning models in a supervised learning context using a large human activity and delivery labeled accelerometry dataset. We extracted several features, which were included to develop random forest, logistic regression, mixed effects regression, and multilayer perceptron models, while convolutional neural network, recurrent neural network, and hybrid convolutional recurrent neural network models were developed without feature extraction. Model performances were assessed using Monte Carlo cross-validation.ResultsWe found that a hybrid convolutional recurrent neural network performed best in the classification task with an F1 score of 0.960 but simpler models such as logistic regression and random forest also had excellent performance with F1 scores of 0.951 and 0.957, respectively.ConclusionThe models developed in this study can be used to classify days in accelerometry data as either human or delivery activity. An analyst can weigh the larger computational cost and greater performance of the convolutional recurrent neural network against the faster but slightly less powerful random forest or logistic regression. The best performing models for classification of delivery data are publicly available on the open source R package, PhysicalActivity.

Download Full-text

A Study of the Classification of Motor Imagery Signals using Machine Learning Tools

10.5121/csit.2021.112104 ◽

2021 ◽

Author(s):

Anam Hashmi ◽

Bilal Alam Khan ◽

Omar Farooq

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Wavelet Transform ◽

Random Forest ◽

Random Forest Algorithm ◽

Eeg Signals ◽

Relaxation State ◽

Wavelet Transform Analysis ◽

Imagined Movement

In this paper, we propose a system for the purpose of classifying Electroencephalography (EEG) signals associated with imagined movement of right hand and relaxation state using machine learning algorithm namely Random Forest Algorithm. The EEG dataset used in this research was created by the University of Tubingen, Germany. EEG signals associated with the imagined movement of right hand and relaxation state were processed using wavelet transform analysis with Daubechies orthogonal wavelet as the mother wavelet. After the wavelet transform analysis, eight features were extracted. Subsequently, a feature selection method based on Random Forest Algorithm was employed giving us the best features out of the eight proposed features. The feature selection stage was followed by classification stage in which eight different models combining the different features based on their importance were constructed. The optimum classification performance of 85.41% was achieved with the Random Forest classifier. This research shows that this system of classification of motor movements can be used in a Brain Computer Interface system (BCI) to mentally control a robotic device or an exoskeleton.

Download Full-text

Classification of iron oxide aerosols by a single particle soot photometer using supervised machine learning

Atmospheric Measurement Techniques ◽

10.5194/amt-12-3885-2019 ◽

2019 ◽

Vol 12 (7) ◽

pp. 3885-3906 ◽

Cited By ~ 2

Author(s):

Kara D. Lamb

Keyword(s):

Machine Learning ◽

Random Forest ◽

Test Data ◽

Single Particle ◽

Broad Band ◽

Supervised Machine Learning ◽

Data Sets ◽

Specific Class ◽

Random Forest Algorithm

Abstract. Single particle soot photometers (SP2) use laser-induced incandescence to detect aerosols on a single particle basis. SP2s that have been modified to provide greater spectral contrast between their narrow and broad-band incandescent detectors have previously been used to characterize both refractory black carbon (rBC) and light-absorbing metallic aerosols, including iron oxides (FeOx). However, single particles cannot be unambiguously identified from their incandescent peak height (a function of particle mass) and color ratio (a measure of blackbody temperature) alone. Machine learning offers a promising approach for improving the classification of these aerosols. Here we explore the advantages and limitations of classifying single particle signals obtained with a modified SP2 using a supervised machine learning algorithm. Laboratory samples of different aerosols that incandesce in the SP2 (fullerene soot, mineral dust, volcanic ash, coal fly ash, Fe2O3, and Fe3O4) were used to train a random forest algorithm. The trained algorithm was then applied to test data sets of laboratory samples and atmospheric aerosols. This method provides a systematic approach for classifying incandescent aerosols by providing a score, or conditional probability, that a particle is likely to belong to a particular aerosol class (rBC, FeOx, etc.) given its observed single particle features. We consider two alternative approaches for identifying aerosols in mixed populations based on their single particle SP2 response: one with specific class labels for each species sampled, and one with three broader classes (rBC, anthropogenic FeOx, and dust-like) for particles with similar SP2 responses. Predictions of the most likely particle class (the one with the highest mean probability) based on applying the trained random forest algorithm to the single particle features for test data sets comprising examples of each class are compared with the true class for those particles to estimate generalization performance. While the specific class approach performed well for rBC and Fe3O4 (≥99 % of these aerosols are correctly identified), its classification of other aerosol types is significantly worse (only 47 %–66 % of other particles are correctly identified). Using the broader class approach, we find a classification accuracy of 99 % for FeOx samples measured in the laboratory. The method allows for classification of FeOx as anthropogenic or dust-like for aerosols with effective spherical diameters from 170 to >1200 nm. The misidentification of both dust-like aerosols and rBC as anthropogenic FeOx is small, with <3 % of the dust-like aerosols and <0.1 % of rBC misidentified as FeOx for the broader class case. When applying this method to atmospheric observations taken in Boulder, CO, a clear mode consistent with FeOx was observed, distinct from dust-like aerosols.

Download Full-text

Classification of Phishing Email Using Random Forest Machine Learning Technique

Journal of Applied Mathematics ◽

10.1155/2014/425731 ◽

2014 ◽

Vol 2014 ◽

pp. 1-6 ◽

Cited By ~ 40

Author(s):

Andronicus A. Akinyelu ◽

Aderemi O. Adewumi

Keyword(s):

Machine Learning ◽

Random Forest ◽

Learning Algorithm ◽

False Negative ◽

Machine Learning Algorithm ◽

Detection Techniques ◽

Phishing Attacks ◽

Learning Technique ◽

Phishing Detection

Phishing is one of the major challenges faced by the world of e-commerce today. Thanks to phishing attacks, billions of dollars have been lost by many companies and individuals. In 2012, an online report put the loss due to phishing attack at about $1.5 billion. This global impact of phishing attacks will continue to be on the increase and thus requires more efficient phishing detection techniques to curb the menace. This paper investigates and reports the use of random forest machine learning algorithm in classification of phishing attacks, with the major objective of developing an improved phishing email classifier with better prediction accuracy and fewer numbers of features. From a dataset consisting of 2000 phishing and ham emails, a set of prominent phishing email features (identified from the literature) were extracted and used by the machine learning algorithm with a resulting classification accuracy of 99.7% and low false negative (FN) and false positive (FP) rates.

Download Full-text

Machine Learning Approaches for Auto Insurance Big Data

Risks ◽

10.3390/risks9020042 ◽

2021 ◽

Vol 9 (2) ◽

pp. 42 ◽

Cited By ~ 1

Author(s):

Mohamed Hanafy ◽

Ruixing Ming

Keyword(s):

Machine Learning ◽

Logistic Regression ◽

Big Data ◽

Random Forest ◽

Decision Trees ◽

Customer Service ◽

Learning Approaches ◽

Auto Insurance ◽

New Methods ◽

Better Than

The growing trend in the number and severity of auto insurance claims creates a need for new methods to efficiently handle these claims. Machine learning (ML) is one of the methods that solves this problem. As car insurers aim to improve their customer service, these companies have started adopting and applying ML to enhance the interpretation and comprehension of their data for efficiency, thus improving their customer service through a better understanding of their needs. This study considers how automotive insurance providers incorporate machinery learning in their company, and explores how ML models can apply to insurance big data. We utilize various ML methods, such as logistic regression, XGBoost, random forest, decision trees, naïve Bayes, and K-NN, to predict claim occurrence. Furthermore, we evaluate and compare these models’ performances. The results showed that RF is better than other methods with the accuracy, kappa, and AUC values of 0.8677, 0.7117, and 0.840, respectively.

Download Full-text

Phishing detection system using nachine learning classifiers

Indonesian Journal of Electrical Engineering and Computer Science ◽

10.11591/ijeecs.v17.i3.pp1165-1171 ◽

2020 ◽

Vol 17 (3) ◽

pp. 1165

Author(s):

Nur Sholihah Zaini ◽

Deris Stiawan ◽

Mohd Faizal Ab Razak ◽

Ahmad Firdaus ◽

Wan Isni Sofiah Wan Din ◽

...

Keyword(s):

Machine Learning ◽

Random Forest ◽

Detection System ◽

True Positive Rate ◽

Detection Accuracy ◽

Learning Classifier ◽

Learning Classifiers ◽

Phishing Attacks ◽

Positive Rate ◽

Website Features

<span>The increasing development of the Internet, more and more applications are put into websites can be directly accessed through the network. This development has attracted an attacker with phishing websites to compromise computer systems. Several solutions have been proposed to detect a phishing attack. However, there still room for improvement to tackle this phishing threat. This paper aims to investigate and evaluate the effectiveness of machine learning approach in the classification of phishing attack. This paper applied a heuristic approach with machine learning classifier to identify phishing attacks noted in the web site applications. The study compares with five classifiers to find the best machine learning classifiers in detecting phishing attacks. In identifying the phishing attacks, it demonstrates that random forest is able to achieve high detection accuracy with true positive rate value of 94.79% using website features. The results indicate that random forest is effective classifiers for detecting phishing attacks.</span>

Download Full-text

Classification of Neurodegenerative Disease Stages using Ensemble Machine Learning Classifiers

Procedia Computer Science ◽

10.1016/j.procs.2020.01.071 ◽

2019 ◽

Vol 165 ◽

pp. 66-73 ◽

Cited By ~ 1

Author(s):

M. Rohini ◽

D. Surendran

Keyword(s):

Machine Learning ◽

Neurodegenerative Disease ◽

Machine Learning Classifiers ◽

Learning Classifiers ◽

Ensemble Machine Learning ◽

Disease Stages

Download Full-text

Price Prediction for Pre-Owned Cars Using Ensemble Machine Learning Techniques

10.3233/apc210194 ◽

2021 ◽

Author(s):

Chetna Longani ◽

Sai Prasad Potharaju ◽

Sandhya Deore

Keyword(s):

Machine Learning ◽

Random Forest ◽

Mean Squared Error ◽

Machine Learning Techniques ◽

Random Forest Algorithm ◽

Fair Price ◽

Ensemble Machine Learning ◽

Comparable Performance ◽

Used Car ◽

Used Cars

The Pre-owned cars or so-called used cars have capacious markets across the globe. Before acquiring a used car, the buyer should be able to decide whether the price affixed for the car is genuine. Several facets including mileage, year, model, make, run and many more are needed to be considered before getting a hold of any pre-owned car. Both the seller and the buyer should have a fair deal. This paper presents a system that has been implemented to predict a fair price for any pre-owned car. The system works well to anticipate the price of used cars for the Mumbai region. Ensemble techniques in machine learning namely Random Forest Algorithm, eXtreme Gradient Boost are deployed to develop models that can predict an appropriate price for the used cars. The techniques are compared so as to determine an optimal one. Both the methods provided comparable performance wherein eXtreme Boost outperformed the random forest algorithm. Root Mean Squared Error of random forest recorded 3.44 whereas eXtreme Boost displayed 0.53.

Download Full-text