Regularization based discriminative feature pattern selection for the classification of Parkinson cases using machine learning

Abstract Objectives This paper focuses on developing a regularization-based feature selection approach to select the most effective attributes from the Parkinson’s speech dataset. Parkinson’s disease is a medical condition that progresses as the dopamine-producing nerve cells are affected. Early diagnosis often reduces the effect on the individuals, minimizes the advancement over time. In recent times, intelligent computational models are used in many complex cases to diagnose a clinical condition with high precision. These models are intended to find meaningful representation from the data to diagnose the disease. Machine learning acts as a tool, gears up the model learning process through a mathematical baseline. But, not in all cases, machine learning will be demanded to perform optimally. It comes with a few constraints, mainly the representation of the data. The learning models expect a clean, noise-free input, which in-turns produces better discriminative patterns over different categories of classes. Methods The proposed model identified five candidate features as predictors. This feature subset is trained with different varieties of supervised classifiers to trace out the best-performing model. Results The results are validated through accuracy, precision, recall, and receiver’s operational characteristic curves. The proposed regularization- based feature selection model outperformed the benchmark algorithms by attaining 100% accuracy on most of the classifiers, other than linear discriminant analysis (99.90%) and naïve Bayes (99.51%). Conclusions This paper exhibits the need for intelligent models to analyze complex data patterns to assist medical practitioners in better disease diagnosis. The results exhibit that the regularization methods find the best features based on their importance score, which improved the model performance over other feature selection methods.

Download Full-text

Automated Feature Selection and Classification for High-Dimensional Biomedical Data

10.21203/rs.3.rs-563410/v1 ◽

2021 ◽

Author(s):

Tammo P.A. Beishuizen ◽

Joaquin Vanschoren ◽

Peter A.J. Hilbers ◽

Dragan Bošnački

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Automated System ◽

Complex Data ◽

Biomedical Data ◽

Selection Methods ◽

Model Predictions ◽

Automated Machine Learning ◽

Feature Selection Techniques ◽

Best Fit

Abstract Background: Automated machine learning aims to automate the building of accurate predictive models, including the creation of complex data preprocessing pipelines. Although successful in many fields, they struggle to produce good results on biomedical datasets, especially given the high dimensionality of the data. Result: In this paper, we explore the automation of feature selection in these scenarios. We analyze which feature selection techniques are ideally included in an automated system, determine how to efficiently find the ones that best fit a given dataset, integrate this into an existing AutoML tool (TPOT), and evaluate it on four very different yet representative types of biomedical data: microarray, mass spectrometry, clinical and survey datasets. We focus on feature selection rather than latent feature generation since we often want to explain the model predictions in terms of the intrinsic features of the data. Conclusion: Our experiments show that for none of these datasets we need more than 200 features to accurately explain the output. Additional features did not increase the quality significantly. We also find that the automated machine learning results are significantly improved after adding additional feature selection methods and prior knowledge on how to select and tune them.

Download Full-text

Feature selection method based on Menger curvature and LDA theory for a P300 brain-computer interface

Journal of Neural Engineering ◽

10.1088/1741-2552/ac42b4 ◽

2021 ◽

Author(s):

ShuRui Li ◽

Jing Jin ◽

Ian Daly ◽

Chang Liu ◽

Andrzej Cichocki

Keyword(s):

Feature Selection ◽

Brain Computer Interface ◽

Feature Selection Method ◽

Event Related Potentials ◽

Selection Method ◽

Computer Interface ◽

Feature Subset ◽

Linear Discriminant ◽

Related Potentials ◽

Menger Curvature

Abstract Brain–computer interface (BCI) systems decode electroencephalogram signals to establish a channel for direct interaction between the human brain and the external world without the need for muscle or nerve control. The P300 speller, one of the most widely used BCI applications, presents a selection of characters to the user and performs character recognition by identifying P300 event-related potentials from the EEG. Such P300-based BCI systems can reach good levels of accuracy but are difficult to use in day-to-day life due to redundancy and noisy signal. A room for improvement should be considered. We propose a novel hybrid feature selection method for the P300-based BCI system to address the problem of feature redundancy, which combines the Menger curvature and linear discriminant analysis. First, selected strategies are applied separately to a given dataset to estimate the gain for application to each feature. Then, each generated value set is ranked in descending order and judged by a predefined criterion to be suitable in classification models. The intersection of the two approaches is then evaluated to identify an optimal feature subset. The proposed method is evaluated using three public datasets, i.e., BCI Competition III dataset II, BNCI Horizon dataset, and EPFL dataset. Experimental results indicate that compared with other typical feature selection and classification methods, our proposed method has better or comparable performance. Additionally, our proposed method can achieve the best classification accuracy after all epochs in three datasets. In summary, our proposed method provides a new way to enhance the performance of the P300-based BCI speller.

Download Full-text

Diagnostic Performance of 2D and 3D T2WI-Based Radiomics Features With Machine Learning Algorithms to Distinguish Solid Solitary Pulmonary Lesion

Frontiers in Oncology ◽

10.3389/fonc.2021.683587 ◽

2021 ◽

Vol 11 ◽

Author(s):

Qi Wan ◽

Jiaxuan Zhou ◽

Xiaoying Xia ◽

Jianfeng Hu ◽

Peng Wang ◽

...

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Diagnostic Performance ◽

Feature Selection Method ◽

Machine Learning Algorithms ◽

Support Vector ◽

Learning Approaches ◽

Selection Methods ◽

Linear Discriminant ◽

2D And 3D

ObjectiveTo evaluate the performance of 2D and 3D radiomics features with different machine learning approaches to classify SPLs based on magnetic resonance(MR) T2 weighted imaging (T2WI).Material and MethodsA total of 132 patients with pathologically confirmed SPLs were examined and randomly divided into training (n = 92) and test datasets (n = 40). A total of 1692 3D and 1231 2D radiomics features per patient were extracted. Both radiomics features and clinical data were evaluated. A total of 1260 classification models, comprising 3 normalization methods, 2 dimension reduction algorithms, 3 feature selection methods, and 10 classifiers with 7 different feature numbers (confined to 3–9), were compared. The ten-fold cross-validation on the training dataset was applied to choose the candidate final model. The area under the receiver operating characteristic curve (AUC), precision-recall plot, and Matthews Correlation Coefficient were used to evaluate the performance of machine learning approaches.ResultsThe 3D features were significantly superior to 2D features, showing much more machine learning combinations with AUC greater than 0.7 in both validation and test groups (129 vs. 11). The feature selection method Analysis of Variance(ANOVA), Recursive Feature Elimination(RFE) and the classifier Logistic Regression(LR), Linear Discriminant Analysis(LDA), Support Vector Machine(SVM), Gaussian Process(GP) had relatively better performance. The best performance of 3D radiomics features in the test dataset (AUC = 0.824, AUC-PR = 0.927, MCC = 0.514) was higher than that of 2D features (AUC = 0.740, AUC-PR = 0.846, MCC = 0.404). The joint 3D and 2D features (AUC=0.813, AUC-PR = 0.926, MCC = 0.563) showed similar results as 3D features. Incorporating clinical features with 3D and 2D radiomics features slightly improved the AUC to 0.836 (AUC-PR = 0.918, MCC = 0.620) and 0.780 (AUC-PR = 0.900, MCC = 0.574), respectively.ConclusionsAfter algorithm optimization, 2D feature-based radiomics models yield favorable results in differentiating malignant and benign SPLs, but 3D features are still preferred because of the availability of more machine learning algorithmic combinations with better performance. Feature selection methods ANOVA and RFE, and classifier LR, LDA, SVM and GP are more likely to demonstrate better diagnostic performance for 3D features in the current study.

Download Full-text

A Hybrid Feature Selection Method for Improve the Accuracy of Medical Classification Process

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.a9624.1111121 ◽

2021 ◽

Vol 11 (1) ◽

pp. 50-55

Author(s):

Maria Mohammad Yousef ◽

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Dimensionality Reduction ◽

Classification Accuracy ◽

Fitness Function ◽

Machine Learning Algorithms ◽

Feature Subset Selection ◽

High Dimensionality ◽

Support Vector ◽

Feature Subset

Generally, medical dataset classification has become one of the biggest problems in data mining research. Every database has a given number of features but it is observed that some of these features can be redundant and can be harmful as well as disrupt the process of classification and this problem is known as a high dimensionality problem. Dimensionality reduction in data preprocessing is critical for increasing the performance of machine learning algorithms. Besides the contribution of feature subset selection in dimensionality reduction gives a significant improvement in classification accuracy. In this paper, we proposed a new hybrid feature selection approach based on (GA assisted by KNN) to deal with issues of high dimensionality in biomedical data classification. The proposed method first applies the combination between GA and KNN for feature selection to find the optimal subset of features where the classification accuracy of the k-Nearest Neighbor (kNN) method is used as the fitness function for GA. After selecting the best-suggested subset of features, Support Vector Machine (SVM) are used as the classifiers. The proposed method experiments on five medical datasets of the UCI Machine Learning Repository. It is noted that the suggested technique performs admirably on these databases, achieving higher classification accuracy while using fewer features.

Download Full-text

A Novel Feature Selection Method Based on Maximum Likelihood Logistic Regression for Imbalanced Learning in Software Defect Prediction

The International Arab Journal of Information Technology ◽

10.34028/iajit/17/5/5 ◽

2020 ◽

Vol 17 (5) ◽

pp. 721-730

Author(s):

Kamal Bashir ◽

Tianrui Li ◽

Mahama Yahaya

Keyword(s):

Machine Learning ◽

Logistic Regression ◽

Feature Selection ◽

Maximum Likelihood ◽

Defect Prediction ◽

Feature Subset ◽

Software Defect Prediction ◽

Software Defect ◽

Optimal Feature Subset ◽

Optimal Feature

The most frequently used machine learning feature ranking approaches failed to present optimal feature subset for accurate prediction of defective software modules in out-of-sample data. Machine learning Feature Selection (FS) algorithms such as Chi-Square (CS), Information Gain (IG), Gain Ratio (GR), RelieF (RF) and Symmetric Uncertainty (SU) perform relatively poor at prediction, even after balancing class distribution in the training data. In this study, we propose a novel FS method based on the Maximum Likelihood Logistic Regression (MLLR). We apply this method on six software defect datasets in their sampled and unsampled forms to select useful features for classification in the context of Software Defect Prediction (SDP). The Support Vector Machine (SVM) and Random Forest (RaF) classifiers are applied on the FS subsets that are based on sampled and unsampled datasets. The performance of the models captured using Area Ander Receiver Operating Characteristics Curve (AUC) metrics are compared for all FS methods considered. The Analysis Of Variance (ANOVA) F-test results validate the superiority of the proposed method over all the FS techniques, both in sampled and unsampled data. The results confirm that the MLLR can be useful in selecting optimal feature subset for more accurate prediction of defective modules in software development process

Download Full-text

Computational Method for Identifying Malonylation Sites by Using Random Forest Algorithm

Combinatorial Chemistry & High Throughput Screening ◽

10.2174/1386207322666181227144318 ◽

2020 ◽

Vol 23 (4) ◽

pp. 304-312

Author(s):

ShaoPeng Wang ◽

JiaRui Li ◽

Xijun Sun ◽

Yu-Hang Zhang ◽

Tao Huang ◽

...

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Random Forest ◽

Learning Algorithms ◽

Selection Procedure ◽

Machine Learning Algorithms ◽

Computational Method ◽

Feature Subset ◽

Random Forest Algorithm ◽

Post Translational Modification

Background: As a newly uncovered post-translational modification on the ε-amino group of lysine residue, protein malonylation was found to be involved in metabolic pathways and certain diseases. Apart from experimental approaches, several computational methods based on machine learning algorithms were recently proposed to predict malonylation sites. However, previous methods failed to address imbalanced data sizes between positive and negative samples. Objective: In this study, we identified the significant features of malonylation sites in a novel computational method which applied machine learning algorithms and balanced data sizes by applying synthetic minority over-sampling technique. Method: Four types of features, namely, amino acid (AA) composition, position-specific scoring matrix (PSSM), AA factor, and disorder were used to encode residues in protein segments. Then, a two-step feature selection procedure including maximum relevance minimum redundancy and incremental feature selection, together with random forest algorithm, was performed on the constructed hybrid feature vector. Results: An optimal classifier was built from the optimal feature subset, which featured an F1-measure of 0.356. Feature analysis was performed on several selected important features. Conclusion: Results showed that certain types of PSSM and disorder features may be closely associated with malonylation of lysine residues. Our study contributes to the development of computational approaches for predicting malonyllysine and provides insights into molecular mechanism of malonylation.

Download Full-text

Intelligent Feature Subset Selection with Machine Learning based Risk Management for DAS Prediction

10.54216/jcim.080101 ◽

2021 ◽

pp. 08-16

Author(s):

Mohamed Abdel Abdel-Basset ◽

◽

Mohamed Elhoseny

Keyword(s):

Machine Learning ◽

Risk Management ◽

Feature Selection ◽

Subset Selection ◽

Classification Problem ◽

Feature Subset Selection ◽

Feature Subset ◽

Time Machine ◽

Primary Level ◽

Stage Process

In the current epidemic situations, people are facing several mental disorders related to Depression, Anxiety, and Stress (DAS). Numerous scales are developed for computing the levels for DAS, and DAS-21 is one among them. At the same time, machine learning (ML) models are applied widely to resolve the classification problem efficiently, and feature selection (FS) approaches can be designed to improve the classifier results. In this aspect, this paper develops an intelligent feature selection with ML-based risk management (IFSML-RM) for DAS prediction. The IFSML-RM technique follows a two-stage process: quantum elephant herd optimization-based FS (QEHO-FS) and decision tree (DT) based classification. The QEHO algorithm utilizes the input data to select a valuable subset of features at the primary level. Then, the chosen features are fed into the DT classifier to determine the existence or non-existence of DAS. A detailed experimentation process is carried out on the benchmark dataset, and the experimental results showcased the betterment of the IFSML-RM technique in terms of different performance measures.

Download Full-text

ASSESSMENT OF FEATURE SELECTION AND CLASSIFICATION APPROACHES TO ENHANCE INFORMATION FROM OVERNIGHT OXIMETRY IN THE CONTEXT OF APNEA DIAGNOSIS

International Journal of Neural Systems ◽

10.1142/s0129065713500202 ◽

2013 ◽

Vol 23 (05) ◽

pp. 1350020 ◽

Cited By ~ 35

Author(s):

DANIEL ÁLVAREZ ◽

ROBERTO HORNERO ◽

J. VÍCTOR MARCOS ◽

NIELS WESSEL ◽

THOMAS PENZEL ◽

...

Keyword(s):

Feature Selection ◽

High Performance ◽

Principal Component ◽

Screening Tools ◽

Support Vector ◽

Feature Subset ◽

Blood Oxygen Saturation ◽

Test Set ◽

Linear Discriminant ◽

Validation Set

This study is aimed at assessing the usefulness of different feature selection and classification methodologies in the context of sleep apnea hypopnea syndrome (SAHS) detection. Feature extraction, selection and classification stages were applied to analyze blood oxygen saturation (SaO2) recordings in order to simplify polysomnography (PSG), the gold standard diagnostic methodology for SAHS. Statistical, spectral and nonlinear measures were computed to compose the initial feature set. Principal component analysis (PCA), forward stepwise feature selection (FSFS) and genetic algorithms (GAs) were applied to select feature subsets. Fisher's linear discriminant (FLD), logistic regression (LR) and support vector machines (SVMs) were applied in the classification stage. Optimum classification algorithms from each combination of these feature selection and classification approaches were prospectively validated on datasets from two independent sleep units. FSFS + LR achieved the highest diagnostic performance using a small feature subset (4 features), reaching 83.2% accuracy in the validation set and 88.7% accuracy in the test set. Similarly, GAs + SVM also achieved high generalization capability using a small number of input features (7 features), with 84.2% accuracy on the validation set and 84.5% accuracy in the test set. Our results suggest that reduced subsets of complementary features (25% to 50% of total features) and classifiers with high generalization ability could provide high-performance screening tools in the context of SAHS.

Download Full-text

Hybrid Machine Learning Classifiers for Indoor User Localization Problem

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.c8375.0110321 ◽

2021 ◽

Vol 10 (3) ◽

pp. 49-53

Author(s):

Hamza Turabieh ◽

Ahmad S. Alghamdi

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Indoor Localization ◽

Signal Strength ◽

Access Point ◽

Support Vector ◽

Linear Discriminant ◽

Machine Learning Classifiers ◽

Learning Classifiers ◽

User Location

Wi-Fi technology is now everywhere either inside or outside buildings. Using Wi-fi technology introduces an indoor localization service(s) (ILS). Determining indoor user location is a hard and complex problem. Several applications highlight the importance of indoor user localization such as disaster management, health care zones, Internet of Things applications (IoT), and public settlement planning. The measurements of Wi-Fi signal strength (i.e., Received Signal Strength Indicator (RSSI)) can be used to determine indoor user location. In this paper, we proposed a hybrid model between a wrapper feature selection algorithm and machine learning classifiers to determine indoor user location. We employed the Minimum Redundancy Maximum Relevance (mRMR) algorithm as a feature selection to select the most active access point (AP) based on RSSI values. Six different machine learning classifiers were used in this work (i.e., Decision Tree (DT), Support Vector Machine (SVM), k-nearest neighbors (kNN), Linear Discriminant Analysis (LDA), Ensemble-Bagged Tree (EBaT), and Ensemble Boosted Tree (EBoT)). We examined all classifiers on a public dataset obtained from UCI repository. The obtained results show that EBoT outperforms all other classifiers based on accuracy value/

Download Full-text

PLncWX: A Machine-Learning Algorithm for Plant lncRNA Identification Based on WOA-XGBoost

Journal of Chemistry ◽

10.1155/2021/6256021 ◽

2021 ◽

Vol 2021 ◽

pp. 1-11

Author(s):

Fei Guo ◽

Zhixiang Yin ◽

Kai Zhou ◽

Jiasi Li

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Stress Responses ◽

Learning Algorithm ◽

Feature Subset ◽

Selection Methods ◽

Regulate Gene Expression ◽

Model Redundancy ◽

Human And Mouse ◽

Plant Abiotic Stress

Long noncoding RNAs (lncRNAs) are a class of RNAs longer than 200 nt and cannot encode the protein. Studies have shown that lncRNAs can regulate gene expression at the epigenetic, transcriptional, and posttranscriptional levels, which are not only closely related to the occurrence, development, and prevention of human diseases, but also can regulate plant flowering and participate in plant abiotic stress responses such as drought and salt. Therefore, how to accurately and efficiently identify lncRNAs is still an essential job of relevant researches. There have been a large number of identification tools based on machine-learning and deep learning algorithms, mostly using human and mouse gene sequences as training sets, seldom plants, and only using one or one class of feature selection methods after feature extraction. We developed an identification model containing dicot, monocot, algae, moss, and fern. After comparing 20 feature selection methods (seven filter and thirteen wrapper methods) combined with seven classifiers, respectively, considering the correlation between features and model redundancy at the same time, we found that the WOA-XGBoost-based model had better performance with 91.55%, 96.78%, and 91.68% of accuracy, AUC, and F1_score. Meanwhile, the number of elements in the feature subset was reduced to 23, which effectively improved the prediction accuracy and modeling efficiency.

Download Full-text