IoT information theft prediction using ensemble feature selection

AbstractThe recent years have seen a proliferation of Internet of Things (IoT) devices and an associated security risk from an increasing volume of malicious traffic worldwide. For this reason, datasets such as Bot-IoT were created to train machine learning classifiers to identify attack traffic in IoT networks. In this study, we build predictive models with Bot-IoT to detect attacks represented by dataset instances from the Information Theft category, as well as dataset instances from the data exfiltration and keylogging subcategories. Our contribution is centered on the evaluation of ensemble feature selection techniques (FSTs) on classification performance for these specific attack instances. A group or ensemble of FSTs will often perform better than the best individual technique. The classifiers that we use are a diverse set of four ensemble learners (Light GBM, CatBoost, XGBoost, and random forest (RF)) and four non-ensemble learners (logistic regression (LR), decision tree (DT), Naive Bayes (NB), and a multi-layer perceptron (MLP)). The metrics used for evaluating classification performance are area under the receiver operating characteristic curve (AUC) and Area Under the precision-recall curve (AUPRC). For the most part, we determined that our ensemble FSTs do not affect classification performance but are beneficial because feature reduction eases computational burden and provides insight through improved data visualization.

Download Full-text

Detecting web attacks using random undersampling and ensemble learners

Journal Of Big Data ◽

10.1186/s40537-021-00460-8 ◽

2021 ◽

Vol 8 (1) ◽

Author(s):

Richard Zuech ◽

John Hancock ◽

Taghi M. Khoshgoftaar

Keyword(s):

Operating Characteristic ◽

Performance Metrics ◽

Characteristic Curve ◽

Class Imbalance ◽

Classification Performance ◽

Web Attacks ◽

Random Undersampling ◽

Precision Recall Curve ◽

Research Questions ◽

Recall Curve

AbstractClass imbalance is an important consideration for cybersecurity and machine learning. We explore classification performance in detecting web attacks in the recent CSE-CIC-IDS2018 dataset. This study considers a total of eight random undersampling (RUS) ratios: no sampling, 999:1, 99:1, 95:5, 9:1, 3:1, 65:35, and 1:1. Additionally, seven different classifiers are employed: Decision Tree (DT), Random Forest (RF), CatBoost (CB), LightGBM (LGB), XGBoost (XGB), Naive Bayes (NB), and Logistic Regression (LR). For classification performance metrics, Area Under the Receiver Operating Characteristic Curve (AUC) and Area Under the Precision-Recall Curve (AUPRC) are both utilized to answer the following three research questions. The first question asks: “Are various random undersampling ratios statistically different from each other in detecting web attacks?” The second question asks: “Are different classifiers statistically different from each other in detecting web attacks?” And, our third question asks: “Is the interaction between different classifiers and random undersampling ratios significant for detecting web attacks?” Based on our experiments, the answers to all three research questions is “Yes”. To the best of our knowledge, we are the first to apply random undersampling techniques to web attacks from the CSE-CIC-IDS2018 dataset while exploring various sampling ratios.

Download Full-text

Detecting cybersecurity attacks across different network features and learners

Journal Of Big Data ◽

10.1186/s40537-021-00426-w ◽

2021 ◽

Vol 8 (1) ◽

Author(s):

Joffrey L. Leevy ◽

John Hancock ◽

Richard Zuech ◽

Taghi M. Khoshgoftaar

Keyword(s):

Feature Selection ◽

Intrusion Detection ◽

Operating Characteristic ◽

Characteristic Curve ◽

Machine Learning Algorithms ◽

Feature Selection Technique ◽

Impact Performance ◽

Detection Model ◽

Wide Range ◽

Research Questions

AbstractMachine learning algorithms efficiently trained on intrusion detection datasets can detect network traffic capable of jeopardizing an information system. In this study, we use the CSE-CIC-IDS2018 dataset to investigate ensemble feature selection on the performance of seven classifiers. CSE-CIC-IDS2018 is big data (about 16,000,000 instances), publicly available, modern, and covers a wide range of realistic attack types. Our contribution is centered around answers to three research questions. The first question is, “Does feature selection impact performance of classifiers in terms of Area Under the Receiver Operating Characteristic Curve (AUC) and F1-score?” The second question is, “Does including the Destination_Port categorical feature significantly impact performance of LightGBM and Catboost in terms of AUC and F1-score?” The third question is, “Does the choice of classifier: Decision Tree (DT), Random Forest (RF), Naive Bayes (NB), Logistic Regression (LR), Catboost, LightGBM, or XGBoost, significantly impact performance in terms of AUC and F1-score?” These research questions are all answered in the affirmative and provide valuable, practical information for the development of an efficient intrusion detection model. To the best of our knowledge, we are the first to use an ensemble feature selection technique with the CSE-CIC-IDS2018 dataset.

Download Full-text

Feature Selection using Genetic Programming

Zambia ICT Journal ◽

10.33260/zictjournal.v3i2.62 ◽

2019 ◽

Vol 3 (2) ◽

pp. 11-18

Author(s):

George Mweshi

Keyword(s):

Feature Selection ◽

Genetic Programming ◽

Information Gain ◽

Principal Component ◽

Search Space ◽

Classification Performance ◽

The Other ◽

Searching Strategy ◽

Mining Algorithms ◽

Feature Selection Techniques

Extracting useful and novel information from the large amount of collected data has become a necessity for corporations wishing to maintain a competitive advantage. One of the biggest issues in handling these significantly large datasets is the curse of dimensionality. As the dimension of the data increases, the performance of the data mining algorithms employed to mine the data deteriorates. This deterioration is mainly caused by the large search space created as a result of having irrelevant, noisy and redundant features in the data. Feature selection is one of the various techniques that can be used to remove these unnecessary features. Feature selection consequently reduces the dimension of the data as well as the search space which in turn increases the efficiency and the accuracy of the mining algorithms. In this paper, we investigate the ability of Genetic Programming (GP), an evolutionary algorithm searching strategy capable of automatically finding solutions in complex and large search spaces, to perform feature selection. We implement a basic GP algorithm and perform feature selection on 5 benchmark classification datasets from UCI repository. To test the competitiveness and feasibility of the GP approach, we examine the classification performance of four classifiers namely J48, Naives Bayes, PART, and Random Forests using the GP selected features, all the original features and the features selected by the other commonly used feature selection techniques i.e. principal component analysis, information gain, relief-f and cfs. The experimental results show that not only does GP select a smaller set of features from the original features, classifiers using GP selected features achieve a better classification performance than using all the original features. Furthermore, compared to the other well-known feature selection techniques, GP achieves very competitive results.

Download Full-text

Comparisons of QPFs Derived from Single- and Multicore Convection-Allowing Ensembles

Weather and Forecasting ◽

10.1175/waf-d-19-0128.1 ◽

2019 ◽

Vol 34 (6) ◽

pp. 1955-1964

Author(s):

Adam J. Clark

Keyword(s):

Operating Characteristic ◽

Characteristic Curve ◽

Skill Score ◽

Ensemble Forecast ◽

Grid Spacing ◽

Forecast System ◽

Relative Operating Characteristic ◽

Core System ◽

Testing And Evaluation ◽

Better Than

Abstract This study compares ensemble precipitation forecasts from 10-member, 3-km grid-spacing, CONUS domain single- and multicore ensembles that were a part of the 2016 Community Leveraged Unified Ensemble (CLUE) that was run for the 2016 NOAA Hazardous Weather Testbed Spring Forecasting Experiment. The main results are that a 10-member ARW ensemble was significantly more skillful than a 10-member NMMB ensemble, and a 10-member MIX ensemble (5 ARW and 5 NMMB members) performed about the same as the 10-member ARW ensemble. Skill was measured by area under the relative operating characteristic curve (AUC) and fractions skill score (FSS). Rank histograms in the ARW ensemble were flatter than the NMMB ensemble indicating that the envelope of ensemble members better encompassed observations (i.e., better reliability) in the ARW. Rank histograms in the MIX ensemble were similar to the ARW ensemble. In the context of NOAA’s plans for a Unified Forecast System featuring a CAM ensemble with a single core, the results are positive and indicate that it should be possible to develop a single-core system that performs as well as or better than the current operational CAM ensemble, which is known as the High-Resolution Ensemble Forecast System (HREF). However, as new modeling applications are developed and incremental changes that move HREF toward a single-core system are made possible, more thorough testing and evaluation should be conducted.

Download Full-text

Are Body Composition Parameters Better than Conventional Anthropometric Measures in Predicting Pediatric Hypertension?

International Journal of Environmental Research and Public Health ◽

10.3390/ijerph17165771 ◽

2020 ◽

Vol 17 (16) ◽

pp. 5771

Author(s):

Chih-Yu Hsu ◽

Rong-Ho Lin ◽

Yu-Ching Lin ◽

Jau-Yuan Chen ◽

Wen-Cheng Li ◽

...

Keyword(s):

Body Composition ◽

Receiver Operating Characteristic Curve ◽

Operating Characteristic ◽

Characteristic Curve ◽

Fat Free Mass ◽

Z Score ◽

Pediatric Hypertension ◽

Anthropometric Measures ◽

Operating Characteristic Curve ◽

Better Than

Body composition (BC) parameters are associated with cardiometabolic diseases in children; however, the importance of BC parameters for predicting pediatric hypertension is inconclusive. This cross-sectional study aimed to compare the difference in predictive values of BC parameters and conventional anthropometric measures for pediatric hypertension in school-aged children. A total of 340 children (177 girls and 163 boys) with a mean age of 8.8 ± 1.7 years and mean body mass index (BMI) z-score of 0.50 ± 1.24 were enrolled (102 hypertensive children and 238 normotensive children). Significantly higher values of anthropometric measures (BMI, BMI z-score, BMI percentile, waist-to-height ratio) and BC parameters (body-fat percentage, muscle weight, fat mass, fat-free mass) were observed among the hypertensive subgroup compared to their normotensive counterparts. A prediction model combining fat mass ≥ 3.65 kg and fat-free mass ≥ 34.65 kg (area under the receiver operating characteristic curve = 0.688; sensitivity = 66.7%; specificity = 89.9%) performed better than BMI alone (area under the receiver operating characteristic curve = 0.649; sensitivity = 55.9%; specificity = 73.9%) in predicting hypertension. In conclusion, BC parameters are better than anthropometric measures in predicting pediatric hypertension. BC measuring is a reasonable approach for risk stratification in pediatric hypertension.

Download Full-text

Feature Selection Based on Binary Tree Growth Algorithm for the Classification of Myoelectric Signals

Machines ◽

10.3390/machines6040065 ◽

2018 ◽

Vol 6 (4) ◽

pp. 65 ◽

Cited By ~ 4

Author(s):

Jingwei Too ◽

Abdul Abdullah ◽

Norhashimah Mohd Saad ◽

Nursabillilah Mohd Ali

Keyword(s):

Feature Selection ◽

Tree Growth ◽

Binary Tree ◽

Feature Vector ◽

Classification Performance ◽

Feature Reduction ◽

Feature Subset ◽

Selection Methods ◽

Time Frequency ◽

Mutation Operators

Electromyography (EMG) has been widely used in rehabilitation and myoelectric prosthetic applications. However, a recent increment in the number of EMG features has led to a high dimensional feature vector. This in turn will degrade the classification performance and increase the complexity of the recognition system. In this paper, we have proposed two new feature selection methods based on a tree growth algorithm (TGA) for EMG signals classification. In the first approach, two transfer functions are implemented to convert the continuous TGA into a binary version. For the second approach, the swap, crossover, and mutation operators are introduced in a modified binary tree growth algorithm for enhancing the exploitation and exploration behaviors. In this study, short time Fourier transform (STFT) is employed to transform the EMG signals into time-frequency representation. The features are then extracted from the STFT coefficient and form a feature vector. Afterward, the proposed feature selection methods are applied to evaluate the best feature subset from a large available feature set. The experimental results show the superiority of MBTGA not only in terms of feature reduction, but also the classification performance.

Download Full-text

Sequence-Based Discovery of Antibacterial Peptides Using Ensemble Gradient Boosting

Proceedings ◽

10.3390/proceedings2020066006 ◽

2020 ◽

Vol 66 (1) ◽

pp. 6

Author(s):

Ehdieh Khaledian ◽

Shira L. Broschat

Keyword(s):

Operating Characteristic ◽

Area Under The Curve ◽

Roc Curves ◽

Gradient Boosting ◽

Support Vector ◽

Antibacterial Peptides ◽

Therapeutic Approaches ◽

Laboratory Procedures ◽

Feature Selection Techniques ◽

Better Than

Antimicrobial resistance is driving pharmaceutical companies to investigate different therapeutic approaches. One approach that has garnered growing consideration in drug development is the use of antimicrobial peptides (AMPs). Antibacterial peptides (ABPs), which occur naturally as part of the immune response, can serve as powerful, broad-spectrum antibiotics. However, conventional laboratory procedures for screening and discovering ABPs are expensive and time-consuming. Identification of ABPs can be significantly improved using computational methods. In this paper, we introduce a machine learning method for the fast and accurate prediction of ABPs. We gathered more than 6000 peptides from publicly available datasets and extracted 1209 features (peptide characteristics) from these sequences. We selected the set of optimal features by applying correlation-based and random forest feature selection techniques. Finally, we designed an ensemble gradient boosting model (GBM) to predict putative ABPs. We evaluated our model using receiver operating characteristic (ROC) curves, calculating the area under the curve (AUC) for several different models for comparison, including a recurrent neural network, a support vector machine, and iAMPpred. The AUC for the GBM was ~0.98, more than 3% better than any of the other models.

Download Full-text

EV-Associated miRNAs from Peritoneal Lavage are a Source of Biomarkers in Endometrial Cancer

Cancers ◽

10.3390/cancers11060839 ◽

2019 ◽

Vol 11 (6) ◽

pp. 839 ◽

Cited By ~ 8

Author(s):

Berta Roman-Canal ◽

Cristian Pablo Moiola ◽

Sònia Gatius ◽

Sarah Bonnin ◽

Maria Ruiz-Miró ◽

...

Keyword(s):

Endometrial Cancer ◽

Operating Characteristic ◽

Peritoneal Lavage ◽

Characteristic Curve ◽

Unmet Need ◽

Classification Performance ◽

Novel Approach ◽

Highly Sensitive ◽

Aggressive Histology ◽

Mirna Content

Endometrial cancer (EC) is the sixth most common cancer in women worldwide and is responsible for more than 89,000 deaths every year. Mortality is associated with presence of poor prognostic factors at diagnosis, i.e., diagnosis at an advanced stage, with a high grade and/or an aggressive histology. Development of novel approaches that would permit us to improve the clinical management of EC patients is an unmet need. In this study, we investigate a novel approach to identify highly sensitive and specific biomarkers of EC using extracellular vesicles (EVs) isolated from the peritoneal lavage of EC patients. EVs of peritoneal lavages of 25 EC patients were isolated and their miRNA content was compared with miRNAs of EVs isolated from the ascitic fluid of 25 control patients. Expression of the EV-associated miRNAs was measured using the Taqman OpenArray technology that allowed us to detect 371 miRNAs. The analysis showed that 114 miRNAs were significantly dysregulated in EC patients, among which eight miRNAs, miRNA-383-5p, miRNA-10b-5p, miRNA-34c-3p, miRNA-449b-5p, miRNA-34c-5p, miRNA-200b-3p, miRNA-2110, and miRNA-34b-3p, demonstrated a classification performance at area under the receiver operating characteristic curve (AUC) values above 0.9. This finding opens an avenue for the use of EV-associated miRNAs of peritoneal lavages as an untapped source of biomarkers for EC.

Download Full-text

DETECTION OF BLOOD VESSELS IN RETINAL IMAGES

International Journal of Image and Graphics ◽

10.1142/s0219467810003664 ◽

2010 ◽

Vol 10 (01) ◽

pp. 57-72 ◽

Cited By ~ 3

Author(s):

HEJER JLASSI ◽

KAMEL HAMROUNI

Keyword(s):

Blood Vessels ◽

Anisotropic Diffusion ◽

Operating Characteristic ◽

Characteristic Curve ◽

Retinal Images ◽

Linear Filter ◽

Vascular Tree ◽

Morphological Reconstruction ◽

Processing Steps ◽

Better Than

This paper presents a method to segment blood vessels in retinal images. It is based on mathematical morphology and the anisotropic diffusion and is composed of four steps: image processing by using linear filter and morphological ones, details extraction by using top-hat transform, morphological reconstruction of vascular tree and post processing steps using anisotropic diffusion. Our method is tested on red-free retinal images, taken from two public database. Our results on both public databases were comparable in performance with other authors. The method achieves a good result by mean of the "receiver operating characteristic curve" (ROC). The results show that our method is significantly better than other rule-based methods.

Download Full-text

Drug-Target Interaction Prediction Based on Multisource Information Weighted Fusion

Contrast Media & Molecular Imaging ◽

10.1155/2021/6044256 ◽

2021 ◽

Vol 2021 ◽

pp. 1-10

Author(s):

Shuaiqi Liu ◽

Jingjie An ◽

Jie Zhao ◽

Shuhuan Zhao ◽

Hui Lv ◽

...

Keyword(s):

Drug Target ◽

Operating Characteristic ◽

State Of The Art ◽

Characteristic Curve ◽

Relationship Matrix ◽

Target Interaction ◽

Interaction Prediction ◽

Weighted Fusion ◽

Precision Recall Curve ◽

The Relationship

Recently, in most existing studies, it is assumed that there are no interaction relationships between drugs and targets with unknown interactions. However, unknown interactions mean the relationships between drugs and targets have just not been confirmed. In this paper, samples for which the relationship between drugs and targets has not been determined are considered unlabeled. A weighted fusion method of multisource information is proposed to screen drug-target interactions. Firstly, some drug-target pairs which may have interactions are selected. Secondly, the selected drug-target pairs are added to the positive samples, which are regarded as known to have interaction relationships, and the original interaction relationship matrix is revised. Finally, the revised datasets are used to predict the interaction derived from the bipartite local model with neighbor-based interaction profile inferring (BLM-NII). Experiments demonstrate that the proposed method has greatly improved specificity, sensitivity, precision, and accuracy compared with the BLM-NII method. In addition, compared with several state-of-the-art methods, the area under the receiver operating characteristic curve (AUC) and the area under the precision-recall curve (AUPR) of the proposed method are excellent.

Download Full-text