Comparing Swarm Intelligence Algorithms for Dimension Reduction in Machine Learning

Nowadays, the high-dimensionality of data causes a variety of problems in machine learning. It is necessary to reduce the feature number by selecting only the most relevant of them. Different approaches called Feature Selection are used for this task. In this paper, we propose a Feature Selection method that uses Swarm Intelligence techniques. Swarm Intelligence algorithms perform optimization by searching for optimal points in the search space. We show the usability of these techniques for solving Feature Selection and compare the performance of five major swarm algorithms: Particle Swarm Optimization, Artificial Bee Colony, Invasive Weed Optimization, Bat Algorithm, and Grey Wolf Optimizer. The accuracy of a decision tree classifier was used to evaluate the algorithms. It turned out that the dimension of the data can be reduced about two times without a loss in accuracy. Moreover, the accuracy increased when abandoning redundant features. Based on our experiments GWO turned out to be the best. It has the highest ranking on different datasets, and its average iteration number to find the best solution is 30.8. ABC obtained the lowest ranking on high-dimensional datasets.

Download Full-text

A Novel Feature Selection and Short-Term Price Forecasting Based on a Decision Tree (J48) Model

Energies ◽

10.3390/en12193665 ◽

2019 ◽

Vol 12 (19) ◽

pp. 3665

Author(s):

Ankit Kumar Srivastava ◽

Devender Singh ◽

Ajay Shekhar Pandey ◽

Tarun Maini

Keyword(s):

Feature Selection ◽

Decision Tree ◽

Performance Test ◽

Feature Selection Method ◽

Forecast Accuracy ◽

Price Forecasting ◽

Decision Tree Classifier ◽

Short Term ◽

Tree Classifier ◽

Minimum Number

A novel feature selection method based on a decision tree (J48) for price forecasting is proposed in this work. The method uses a genetic algorithm along with a decision tree classifier to obtain the minimum number of features giving an optimum forecast accuracy. The usefulness of the proposed approach is established through the performance test of the forecaster using the feature selected by this approach. It is found that the forecast with the selected feature consistently out-performed than that having larger feature set.

Download Full-text

Exploring permissions in android applications using ensemble-based extra tree feature selection

Indonesian Journal of Electrical Engineering and Computer Science ◽

10.11591/ijeecs.v19.i1.pp543-552 ◽

2020 ◽

Vol 19 (1) ◽

pp. 543

Author(s):

Howida Abuabker Alkaaf ◽

Aida Ali ◽

Siti Mariyam Shamsuddin ◽

Shafaatunnur Hassan

Keyword(s):

Machine Learning ◽

Support Vector Machine ◽

Feature Selection ◽

False Positive Rate ◽

Feature Selection Method ◽

Support Vector ◽

Multilayer Perceptrons ◽

User Privacy ◽

True Negative ◽

Tree Classifier

<span>The fast development of mobile apps and its usage has led to increase the risk of exploiting user privacy. One method used in Android security mechanism is permission control that restricts the access of apps to core facilities of devices. However, that permissions could be exploited by attackers when granting certain combinations of permissions. So, the aim of this paper is to explore the pattern of malware apps based on analyzing permissions by proposing framework utilizing feature selection based on ensemble extra tree classifier method and machine learning classifier. The used dataset had 25458 samples (8643 malware apps & 16815 benign apps) with 173 features. Three dataset with 25458 samples and 5, 10 and 20 features respectively were generated after using the proposed feature selection method. All the dataset was fed to machine learning. Support Vector machine (SVM), K Neighbors Classifier, Decision Tree, Naïve bayes and Multilayer Perceptron (MLP) classifiers were used. The classifiers models were evaluated using true negative rate (TNR), false positive rate (FNR) and accuracy metrics. The experimental results obtained showed that Support Vector machine and KNeighbors Classifiers with 20 features achieved the highest accuracy with 94 % and TNR with rate of 89 % using KNeighbors Classifier. The FNR rate is dropped to 0.001 using 5 features with support vector machine (SVM) and Multilayer Perceptrons (MLP) classifiers. The result indicated that reducing permission features improved the performance of classification and reduced the computational overhead.</span>

Download Full-text

Machine Learning for Analyzing Non-Countermeasure Factors Affecting Early Spread of COVID-19

International Journal of Environmental Research and Public Health ◽

10.3390/ijerph18136750 ◽

2021 ◽

Vol 18 (13) ◽

pp. 6750

Author(s):

Vito Janko ◽

Gašper Slapničar ◽

Erik Dovgan ◽

Nina Reščič ◽

Tine Kolenik ◽

...

Keyword(s):

Machine Learning ◽

Factor Analysis ◽

Feature Selection ◽

Rule Discovery ◽

Decision Tree Classifier ◽

Factors Affecting ◽

Similar Work ◽

Tree Classifier ◽

Early Phases ◽

Relevant Factors

The COVID-19 pandemic affected the whole world, but not all countries were impacted equally. This opens the question of what factors can explain the initial faster spread in some countries compared to others. Many such factors are overshadowed by the effect of the countermeasures, so we studied the early phases of the infection when countermeasures had not yet taken place. We collected the most diverse dataset of potentially relevant factors and infection metrics to date for this task. Using it, we show the importance of different factors and factor categories as determined by both statistical methods and machine learning (ML) feature selection (FS) approaches. Factors related to culture (e.g., individualism, openness), development, and travel proved the most important. A more thorough factor analysis was then made using a novel rule discovery algorithm. We also show how interconnected these factors are and caution against relying on ML analysis in isolation. Importantly, we explore potential pitfalls found in the methodology of similar work and demonstrate their impact on COVID-19 data analysis. Our best models using the decision tree classifier can predict the infection class with roughly 80% accuracy.

Download Full-text

Interpretable machine learning approach for predicting COVID-19 risk status of an individual

Transactions on Networks and Communications ◽

10.14738/tnc.92.9760 ◽

2021 ◽

Vol 9 (2) ◽

pp. 1-14

Author(s):

Anthony Onoja ◽

Mary Oyinlade Ejiwale ◽

Ayesan Rewane

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Decision Tree ◽

Correlation Coefficient ◽

Pearson Correlation ◽

Accuracy Score ◽

Decision Tree Classifier ◽

Risk Status ◽

Interpretable Machine Learning ◽

Tree Classifier

This study aimed to ascertained using Statistical feature selection methods and interpretable Machine learning models, the best features that predict risk status (“Low”, “Medium”, “High”) to COVID-19 infection. This study utilizes a publicly available dataset obtained via; online web-based risk assessment calculator to ascertain the risk status of COVID-19 infection. 57 out of 59 features were first filtered for multicollinearity using the Pearson correlation coefficient and further shrunk to 55 features by the LASSO GLM approach. SMOTE resampling technique was used to incur the problem of imbalanced class distribution. The interpretable ML algorithms were implored during the classification phase. The best classifier predictions were saved as a new instance and perturbed using a single Decision tree classifier. To further build trust and explainability of the best model, the XGBoost classifier was used as a global surrogate model to train predictions of the best model. The XGBoost individual’s explanation was done using the SHAP explainable AI-framework. Random Forest classifier with a validation accuracy score of 96.35 % from 55 features reduced by feature selection emerged as the best classifier model. The decision tree classifier approximated the best classifier correctly with a prediction accuracy score of 92.23 % and Matthew’s correlation coefficient of 0.8960. The XGBoost classifier approximated the best classifier model with a prediction score of 99.7 %. This study identified COVID-19 positive, COVID-19 contacts, COVID-19 symptoms, Health workers, and Public transport count as the five most consistent features that predict an individual’s risk exposure to COVID-19.

Download Full-text

Feature Selection Algorithm for High-dimensional Biomedical Data Using Information Gain and Improved Chemical Reaction Optimization

Current Bioinformatics ◽

10.2174/1574893615666200204154358 ◽

2021 ◽

Vol 15 (8) ◽

pp. 912-926

Author(s):

Ge Zhang ◽

Pan Yu ◽

Jianlin Wang ◽

Chaokun Yan

Keyword(s):

Feature Selection ◽

Chemical Reaction ◽

Information Gain ◽

Feature Selection Method ◽

Search Space ◽

Neighborhood Search ◽

Biomedical Data ◽

Chemical Reaction Optimization ◽

Search Mechanism ◽

Reaction Optimization

Background: There have been rapid developments in various bioinformatics technologies, which have led to the accumulation of a large amount of biomedical data. However, these datasets usually involve thousands of features and include much irrelevant or redundant information, which leads to confusion during diagnosis. Feature selection is a solution that consists of finding the optimal subset, which is known to be an NP problem because of the large search space. Objective: For the issue, this paper proposes a hybrid feature selection method based on an improved chemical reaction optimization algorithm (ICRO) and an information gain (IG) approach, which called IGICRO. Methods: IG is adopted to obtain some important features. The neighborhood search mechanism is combined with ICRO to increase the diversity of the population and improve the capacity of local search. Results: Experimental results of eight public available data sets demonstrate that our proposed approach outperforms original CRO and other state-of-the-art approaches.

Download Full-text

An Improved Machine Learning-Based Employees Attrition Prediction Framework with Emphasis on Feature Selection

Mathematics ◽

10.3390/math9111226 ◽

2021 ◽

Vol 9 (11) ◽

pp. 1226

Author(s):

Saeed Najafi-Zangeneh ◽

Naser Shams-Gharneh ◽

Ali Arjomandi-Nezhad ◽

Sarfaraz Hashemkhani Zolfani

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Standard Deviation ◽

Analytical Formula ◽

Feature Selection Method ◽

Selection Method ◽

Performance Measure ◽

Learning Approaches ◽

Training Costs ◽

Professional Employees

Companies always seek ways to make their professional employees stay with them to reduce extra recruiting and training costs. Predicting whether a particular employee may leave or not will help the company to make preventive decisions. Unlike physical systems, human resource problems cannot be described by a scientific-analytical formula. Therefore, machine learning approaches are the best tools for this aim. This paper presents a three-stage (pre-processing, processing, post-processing) framework for attrition prediction. An IBM HR dataset is chosen as the case study. Since there are several features in the dataset, the “max-out” feature selection method is proposed for dimension reduction in the pre-processing stage. This method is implemented for the IBM HR dataset. The coefficient of each feature in the logistic regression model shows the importance of the feature in attrition prediction. The results show improvement in the F1-score performance measure due to the “max-out” feature selection method. Finally, the validity of parameters is checked by training the model for multiple bootstrap datasets. Then, the average and standard deviation of parameters are analyzed to check the confidence value of the model’s parameters and their stability. The small standard deviation of parameters indicates that the model is stable and is more likely to generalize well.

Download Full-text

NIMG-46. RADIOGENOMIC FEATURES PREDICT CLINICALLY RELEVANT GENOME-WIDE ALTERATION SIGNATURES IN GLIOBLASTOMA

Neuro-Oncology ◽

10.1093/neuonc/noaa215.659 ◽

2020 ◽

Vol 22 (Supplement_2) ◽

pp. ii158-ii158

Author(s):

Nicholas Nuechterlein ◽

Beibin Li ◽

James Fink ◽

David Haynor ◽

Eric Holland ◽

...

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Molecular Subtypes ◽

Feature Selection Method ◽

Selection Method ◽

Versus Group ◽

Mri Features ◽

Genome Wide ◽

Group 2 ◽

Group 1

Abstract BACKGROUND Previously, we have shown that combined whole-exome sequencing (WES) and genome-wide somatic copy number alteration (SCNA) information can separate IDH1/2-wildtype glioblastoma into two prognostic molecular subtypes (Group 1 and Group 2) and that these subtypes cannot be distinguished by epigenetic or clinical features. However, the potential for radiographic features to discriminate between these molecular subtypes has not been established. METHODS Radiogenomic features (n=35,400) were extracted from 46 multiparametric, pre-operative magnetic resonance imaging (MRI) of IDH1/2-wildtype glioblastoma patients from The Cancer Imaging Archive, all of whom have corresponding WES and SCNA data in The Cancer Genome Atlas. We developed a novel feature selection method that leverages the structure of extracted radiogenomic MRI features to mitigate the dimensionality challenge posed by the disparity between the number of features and patients in our cohort. Seven traditional machine learning classifiers were trained to distinguish Group 1 versus Group 2 using our feature selection method. Our feature selection was compared to lasso feature selection, recursive feature elimination, and variance thresholding. RESULTS We are able to classify Group 1 versus Group 2 glioblastomas with a cross-validated area under the curve (AUC) score of 0.82 using ridge logistic regression and our proposed feature selection method, which reduces the size of our feature set from 35,400 to 288. An interrogation of the selected features suggests that features describing contours in the T2 abnormality region on the FLAIR MRI modality may best distinguish these two groups from one another. CONCLUSIONS We successfully trained a machine learning model that allows for relevant targeted feature extraction from standard MRI to accurately predict molecularly-defined risk-stratifying IDH1/2-wildtype glioblastoma patient groups. This algorithm may be applied to future prospective studies to assess the utility of MRI as a surrogate for costly prognostic genomic studies.

Download Full-text

Radiogenomic modeling predicts survival-associated prognostic groups in glioblastoma

Neuro-Oncology Advances ◽

10.1093/noajnl/vdab004 ◽

2021 ◽

Vol 3 (1) ◽

Author(s):

Nicholas Nuechterlein ◽

Beibin Li ◽

Abdullah Feroze ◽

Eric C Holland ◽

Linda Shapiro ◽

...

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Molecular Subtypes ◽

Feature Selection Method ◽

Area Under The Curve ◽

Selection Method ◽

Recursive Feature Elimination ◽

Signal Abnormality ◽

Mri Features ◽

Mri Scans

Abstract Background Combined whole-exome sequencing (WES) and somatic copy number alteration (SCNA) information can separate isocitrate dehydrogenase (IDH)1/2-wildtype glioblastoma into two prognostic molecular subtypes, which cannot be distinguished by epigenetic or clinical features. The potential for radiographic features to discriminate between these molecular subtypes has yet to be established. Methods Radiologic features (n = 35 340) were extracted from 46 multisequence, pre-operative magnetic resonance imaging (MRI) scans of IDH1/2-wildtype glioblastoma patients from The Cancer Imaging Archive (TCIA), all of whom have corresponding WES/SCNA data. We developed a novel feature selection method that leverages the structure of extracted MRI features to mitigate the dimensionality challenge posed by the disparity between a large number of features and the limited patients in our cohort. Six traditional machine learning classifiers were trained to distinguish molecular subtypes using our feature selection method, which was compared to least absolute shrinkage and selection operator (LASSO) feature selection, recursive feature elimination, and variance thresholding. Results We were able to classify glioblastomas into two prognostic subgroups with a cross-validated area under the curve score of 0.80 (±0.03) using ridge logistic regression on the 15-dimensional principle component analysis (PCA) embedding of the features selected by our novel feature selection method. An interrogation of the selected features suggested that features describing contours in the T2 signal abnormality region on the T2-weighted fluid-attenuated inversion recovery (FLAIR) MRI sequence may best distinguish these two groups from one another. Conclusions We successfully trained a machine learning model that allows for relevant targeted feature extraction from standard MRI to accurately predict molecularly-defined risk-stratifying IDH1/2-wildtype glioblastoma patient groups.

Download Full-text

Optimization of Physical Activity Recognition for Real-Time Wearable Systems: Effect of Window Length, Sampling Frequency and Number of Features

Applied Sciences ◽

10.3390/app9224833 ◽

2019 ◽

Vol 9 (22) ◽

pp. 4833 ◽

Cited By ~ 3

Author(s):

Ardo Allik ◽

Kristjan Pilt ◽

Deniss Karai ◽

Ivo Fridolin ◽

Mairo Leier ◽

...

Keyword(s):

Physical Activity ◽

Feature Selection ◽

Real Time ◽

Sampling Frequency ◽

Classification Performance ◽

Window Length ◽

Decision Tree Classifier ◽

Acceleration Signal ◽

Wearable Systems ◽

Tree Classifier

The aim of this study was to develop an optimized physical activity classifier for real-time wearable systems with the focus on reducing the requirements on device power consumption and memory buffer. Classification parameters evaluated in this study were the sampling frequency of the acceleration signal, window length of the classification fragment, and the number of classification features, found with different feature selection methods. For parameter evaluation, a decision tree classifier was created based on the acceleration signals recorded during tests, where 25 healthy test subjects performed various physical activities. Overall average F1-score achieved in this study was about 0.90. Similar F1-scores were achieved with the evaluated window lengths of 5 s (0.92 ± 0.02) and 3 s (0.91 ± 0.02), while classification performance with 1 s were lower (0.87 ± 0.02). Tested sampling frequencies of 50 Hz, 25 Hz, and 13 Hz had similar results with most classified activity types, with an exception of outdoor cycling, where differences were significant. Using forward sequential feature selection enabled the decreasing of the number of features from initial 110 features to about 12 features without lowering the classification performance. The results of this study have been used for developing more efficient real-time physical activity classifiers.

Download Full-text

Heart Disease Prediction using Machine Learning

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.f9780.059120 ◽

2020 ◽

Vol 9 (1) ◽

pp. 700-704

Keyword(s):

Machine Learning ◽

Heart Disease ◽

Machine Learning Techniques ◽

Support Vector ◽

Disease Prediction ◽

Nearest Neighbour ◽

Decision Tree Classifier ◽

Support Vector Classifier ◽

Learning Techniques ◽

Tree Classifier

Deriving the methodologies to detect heart issues at an earlier stage and intimating the patient to improve their health. To resolve this problem, we will use Machine Learning techniques to predict the incidence at an earlier stage. We have a tendency to use sure parameters like age, sex, height, weight, case history, smoking and alcohol consumption and test like pressure ,cholesterol, diabetes, ECG, ECHO for prediction. In machine learning there are many algorithms which will be used to solve this issue. The algorithms include K-Nearest Neighbour, Support vector classifier, decision tree classifier, logistic regression and Random Forest classifier. Using these parameters and algorithms we need to predict whether or not the patient has heart disease or not and recommend the patient to improve his/her health.

Download Full-text