Automated Feature Selection and Classification for High-Dimensional Biomedical Data

Abstract Background: Automated machine learning aims to automate the building of accurate predictive models, including the creation of complex data preprocessing pipelines. Although successful in many fields, they struggle to produce good results on biomedical datasets, especially given the high dimensionality of the data. Result: In this paper, we explore the automation of feature selection in these scenarios. We analyze which feature selection techniques are ideally included in an automated system, determine how to efficiently find the ones that best fit a given dataset, integrate this into an existing AutoML tool (TPOT), and evaluate it on four very different yet representative types of biomedical data: microarray, mass spectrometry, clinical and survey datasets. We focus on feature selection rather than latent feature generation since we often want to explain the model predictions in terms of the intrinsic features of the data. Conclusion: Our experiments show that for none of these datasets we need more than 200 features to accurately explain the output. Additional features did not increase the quality significantly. We also find that the automated machine learning results are significantly improved after adding additional feature selection methods and prior knowledge on how to select and tune them.

Download Full-text

Sentiment Analysis of Movie Reviews: A Study of Machine Learning Algorithms with Various Feature Selection Methods

International Journal of Computer Sciences and Engineering ◽

10.26438/ijcse/v5i9.113121 ◽

2017 ◽

Vol 5 (9) ◽

Cited By ~ 1

Author(s):

Rajwinder Kaur

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Sentiment Analysis ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Selection Methods

Download Full-text

Effective combining of feature selection techniques for machine learning-enabled IoT intrusion detection

Multimedia Tools and Applications ◽

10.1007/s11042-021-10567-y ◽

2021 ◽

Author(s):

Md Arafatur Rahman ◽

A. Taufiq Asyhari ◽

Ong Wei Wen ◽

Husnul Ajra ◽

Yussuf Ahmed ◽

...

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Intrusion Detection ◽

Feature Selection Techniques

Download Full-text

Feature-Selection and Mutual-Clustering Approaches to Improve DoS Detection and Maintain WSNs’ Lifetime

Sensors ◽

10.3390/s21144821 ◽

2021 ◽

Vol 21 (14) ◽

pp. 4821

Author(s):

Rami Ahmad ◽

Raniyah Wazirali ◽

Qusay Bsoul ◽

Tarik Abu-Ain ◽

Waleed Abu-Ain

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Open Field ◽

Network Lifetime ◽

Detection Efficiency ◽

Denial Of Service ◽

Harmony Search ◽

Machine Learning Algorithms ◽

Transport Layer ◽

Feature Selection Techniques

Wireless Sensor Networks (WSNs) continue to face two major challenges: energy and security. As a consequence, one of the WSN-related security tasks is to protect them from Denial of Service (DoS) and Distributed DoS (DDoS) attacks. Machine learning-based systems are the only viable option for these types of attacks, as traditional packet deep scan systems depend on open field inspection in transport layer security packets and the open field encryption trend. Moreover, network data traffic will become more complex due to increases in the amount of data transmitted between WSN nodes as a result of increasing usage in the future. Therefore, there is a need to use feature selection techniques with machine learning in order to determine which data in the DoS detection process are most important. This paper examined techniques for improving DoS anomalies detection along with power reservation in WSNs to balance them. A new clustering technique was introduced, called the CH_Rotations algorithm, to improve anomaly detection efficiency over a WSN’s lifetime. Furthermore, the use of feature selection techniques with machine learning algorithms in examining WSN node traffic and the effect of these techniques on the lifetime of WSNs was evaluated. The evaluation results showed that the Water Cycle (WC) feature selection displayed the best average performance accuracy of 2%, 5%, 3%, and 3% greater than Particle Swarm Optimization (PSO), Simulated Annealing (SA), Harmony Search (HS), and Genetic Algorithm (GA), respectively. Moreover, the WC with Decision Tree (DT) classifier showed 100% accuracy with only one feature. In addition, the CH_Rotations algorithm improved network lifetime by 30% compared to the standard LEACH protocol. Network lifetime using the WC + DT technique was reduced by 5% compared to other WC + DT-free scenarios.

Download Full-text

Comparative study on total nitrogen prediction in wastewater treatment plant and effect of various feature selection methods on machine learning algorithms performance

Journal of Water Process Engineering ◽

10.1016/j.jwpe.2021.102033 ◽

2021 ◽

Vol 41 ◽

pp. 102033

Author(s):

Faramarz Bagherzadeh ◽

Mohamad-Javad Mehrani ◽

Milad Basirifard ◽

Javad Roostaei

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Wastewater Treatment ◽

Comparative Study ◽

Total Nitrogen ◽

Wastewater Treatment Plant ◽

Learning Algorithms ◽

Treatment Plant ◽

Machine Learning Algorithms ◽

Selection Methods

Download Full-text

Machine learning integrated ensemble of feature selection methods followed by survival analysis for predicting breast cancer subtype specific miRNA biomarkers

Computers in Biology and Medicine ◽

10.1016/j.compbiomed.2021.104244 ◽

2021 ◽

Vol 131 ◽

pp. 104244

Author(s):

Jnanendra Prasad Sarkar ◽

Indrajit Saha ◽

Anasua Sarkar ◽

Ujjwal Maulik

Keyword(s):

Breast Cancer ◽

Machine Learning ◽

Feature Selection ◽

Survival Analysis ◽

Breast Cancer Subtype ◽

Selection Methods ◽

Cancer Subtype

Download Full-text

Incorporate Syntactic Information for Short Text Classification

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.268-270.697 ◽

2011 ◽

Vol 268-270 ◽

pp. 697-700

Author(s):

Rui Xue Duan ◽

Xiao Jie Wang ◽

Wen Feng Li

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Learning Environment ◽

Text Classification ◽

The Internet ◽

Selection Methods ◽

Text Documents ◽

Short Text ◽

Syntactic Information ◽

Dependency Relations

As the volume of online short text documents grow tremendously on the Internet, it is much more urgent to solve the task of organizing the short texts well. However, the traditional feature selection methods cannot suitable for the short text. In this paper, we proposed a method to incorporate syntactic information for the short text. It emphasizes the feature which has more dependency relations with other words. The classifier SVM and machine learning environment Weka are involved in our experiments. The experiment results show that incorporate syntactic information in the short text, we can get more powerful features than traditional feature selection methods, such as DF, CHI. The precision of short text classification improved from 86.2% to 90.8%.

Download Full-text

Random Forests Followed by Computed ABC Analysis as a Feature Selection Method for Machine Learning in Biomedical Data

Studies in Classification, Data Analysis, and Knowledge Organization - Advanced Studies in Classification and Data Science ◽

10.1007/978-981-15-3311-2_5 ◽

2020 ◽

pp. 57-69

Author(s):

Jörn Lötsch ◽

Alfred Ultsch

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Random Forests ◽

Feature Selection Method ◽

Selection Method ◽

Biomedical Data ◽

Abc Analysis

Download Full-text

Improve the Accuracy of Heart Disease Predictions Using Machine Learning and Feature Selection Techniques

Communications in Computer and Information Science - Machine Learning, Image Processing, Network Security and Data Sciences ◽

10.1007/978-981-15-6318-8_19 ◽

2020 ◽

pp. 214-228

Author(s):

Abdelmegeid Amin Ali ◽

Hassan Shaban Hassan ◽

Eman M. Anwar

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Heart Disease ◽

Feature Selection Techniques

Download Full-text

To use or not to use: Feature selection for sentiment analysis of highly imbalanced data

Natural Language Engineering ◽

10.1017/s1351324917000298 ◽

2017 ◽

Vol 24 (1) ◽

pp. 3-37 ◽

Cited By ~ 5

Author(s):

SANDRA KÜBLER ◽

CAN LIU ◽

ZEESHAN ALI SAYYED

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Sentiment Analysis ◽

Information Gain ◽

Binary Classification ◽

Small Subset ◽

Large Set ◽

Learning Approaches ◽

Selection Methods ◽

Data Set

AbstractWe investigate feature selection methods for machine learning approaches in sentiment analysis. More specifically, we use data from the cooking platform Epicurious and attempt to predict ratings for recipes based on user reviews. In machine learning approaches to such tasks, it is a common approach to use word or part-of-speech n-grams. This results in a large set of features, out of which only a small subset may be good indicators for the sentiment. One of the questions we investigate concerns the extension of feature selection methods from a binary classification setting to a multi-class problem. We show that an inherently multi-class approach, multi-class information gain, outperforms ensembles of binary methods. We also investigate how to mitigate the effects of extreme skewing in our data set by making our features more robust and by using review and recipe sampling. We show that over-sampling is the best method for boosting performance on the minority classes, but it also results in a severe drop in overall accuracy of at least 6 per cent points.

Download Full-text

Identification and analysis of the cleavage site in a signal peptide using SMOTE, dagging, and feature selection methods

Molecular Omics ◽

10.1039/c7mo00030h ◽

2018 ◽

Vol 14 (1) ◽

pp. 64-73 ◽

Cited By ~ 16

Author(s):

ShaoPeng Wang ◽

Deling Wang ◽

JiaRui Li ◽

Tao Huang ◽

Yu-Dong Cai

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Signal Peptide ◽

Cleavage Site ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Cleavage Sites ◽

Selection Methods

Several machine learning algorithms were adopted to investigate cleavage sites in a signal peptide. An optimal dagging based classifier was constructed and 870 important features were deemed to be important for this classifier.

Download Full-text