Predictive analytics for blood glucose concentration: an empirical study using the tree-based ensemble approach

PurposeThe primary objective of this study was to recognize critical indicators in predicting blood glucose (BG) through data-driven methods and to compare the prediction performance of four tree-based ensemble models, i.e. bagging with tree regressors (bagging-decision tree [Bagging-DT]), AdaBoost with tree regressors (Adaboost-DT), random forest (RF) and gradient boosting decision tree (GBDT).Design/methodology/approachThis study proposed a majority voting feature selection method by combining lasso regression with the Akaike information criterion (AIC) (LR-AIC), lasso regression with the Bayesian information criterion (BIC) (LR-BIC) and RF to select indicators with excellent predictive performance from initial 38 indicators in 5,642 samples. The selected features were deployed to build the tree-based ensemble models. The 10-fold cross-validation (CV) method was used to evaluate the performance of each ensemble model.FindingsThe results of feature selection indicated that age, corpuscular hemoglobin concentration (CHC), red blood cell volume distribution width (RBCVDW), red blood cell volume and leucocyte count are five most important clinical/physical indicators in BG prediction. Furthermore, this study also found that the GBDT ensemble model combined with the proposed majority voting feature selection method is better than other three models with respect to prediction performance and stability.Practical implicationsThis study proposed a novel BG prediction framework for better predictive analytics in health care.Social implicationsThis study incorporated medical background and machine learning technology to reduce diabetes morbidity and formulate precise medical schemes.Originality/valueThe majority voting feature selection method combined with the GBDT ensemble model provides an effective decision-making tool for predicting BG and detecting diabetes risk in advance.

Download Full-text

Comparison of radiomic pre-processing steps in the reproducible prediction of disease free survival across multi-scanners/centers

10.21203/rs.3.rs-875843/v1 ◽

2021 ◽

Author(s):

Marta Ferreira ◽

Pierre Lovinfosse ◽

Johanne Hermesse ◽

Marjolein Decuypere ◽

Caroline Rousseau ◽

...

Keyword(s):

Feature Selection ◽

Locally Advanced ◽

Feature Selection Method ◽

Disease Free Survival ◽

Selection Method ◽

Prediction Performance ◽

Data Set ◽

Free Survival ◽

Processing Steps ◽

Disease Free

Abstract Background Features reproducibility and the generalizability of the models are currently among the most important limitations when integrating radiomics into the clinics. Radiomic features are sensitive to imaging acquisition protocols, reconstruction algorithms and parameters, as well as by the different steps of the usual radiomics workflow. We propose a framework for comparing the reproducibility of different pre-processing steps in PET/CT radiomic analysis in the prediction of disease free survival (DFS) across multi-scanners/centers. Results We evaluated and compared the prediction performance of several models that differ in i) the type of intensity discretization, ii) feature selection method, iii) features type i.e, original or tumour to liver ratio radiomic features (OR or TLR). We trained our models using data from one scanner/center and tested on two external scanner/centers. Our results show that there is a low reproducibility in predictions across scanners and discretization methods. Despite of this, TLR based models were generally more robust than OR. Maximum relevance minimum redundancy (MRMR) forward feature selection with Pearson correlation was the feature selection method that had the best mean area under the precision recall curve when using it combining the features from all discretization’s bin’s number (D_All_FBN) with TLR features for two of the four classifiers. Conclusion We evaluated and compared the prediction performance of several models in a data set containing hundred fifty-eight patients with locally advanced cervical cancer (LACC) from three distinct scanners. In our cohort of LAAC patients pre-processing of radiomic features in [18F]FDG PET affects DFS predictions performances across scanners and combining the D_All_FBN TLR approach with the MRMR forward Pearson feature selection method might help increasing robustness of radiomic studies.

Download Full-text

Penerapan Ensemble Feature Selection dan Klasterisasi Fitur pada Klasifikasi Dokumen Teks

ComTech Computer Mathematics and Engineering Applications ◽

10.21512/comtech.v4i1.2745 ◽

2013 ◽

Vol 4 (1) ◽

pp. 333

Author(s):

Mediana Aryuni

Keyword(s):

Genetic Algorithm ◽

Feature Selection ◽

Text Categorization ◽

Feature Selection Method ◽

Selection Method ◽

Majority Voting ◽

Iterative Refinement ◽

Ensemble Method ◽

Computational Time ◽

Feature Clustering

An ensemble method is an approach where several classifiers are created from the training data which can be often more accurate than any of the single classifiers, especially if the base classifiers are accurate and different one each other. Menawhile, feature clustering can reduce feature space by joining similar words into one cluster. The objective of this research is to develop a text categorization system that employs feature clustering based on ensemble feature selection. The research methodology consists of text documents preprocessing, feature subspaces generation using the genetic algorithm-based iterative refinement, implementation of base classifiers by applying feature clustering, and classification result integration of each base classifier using both the static selection and majority voting methods. Experimental results show that the computational time consumed in classifying the dataset into 2 and 3 categories using the feature clustering method is 1.18 and 27.04 seconds faster in compared to those that do not employ the feature selection method, respectively. Also, using static selection method, the ensemble feature selection method with genetic algorithm-based iterative refinement produces 10% and 10.66% better accuracy in compared to those produced by the single classifier in classifying the dataset into 2 and 3 categories, respectively. Whilst, using the majority voting method for the same experiment, the similar ensemble method produces 10% and 12% better accuracy than those produced by the single classifier, respectively.

Download Full-text

An Effective Feature Selection Method Using Dynamic Information Criterion

Artificial Intelligence and Computational Intelligence - Lecture Notes in Computer Science ◽

10.1007/978-3-642-23881-9_59 ◽

2011 ◽

pp. 450-455

Author(s):

Huawen Liu ◽

Minshuo Li ◽

Jianmin Zhao ◽

Yuchang Mo

Keyword(s):

Feature Selection ◽

Feature Selection Method ◽

Information Criterion ◽

Selection Method ◽

Dynamic Information

Download Full-text

Effective feature reduction for link prediction in location-based social networks

Journal of Information Science ◽

10.1177/0165551518808200 ◽

2018 ◽

Vol 45 (5) ◽

pp. 676-690 ◽

Cited By ~ 1

Author(s):

Ahmet Engin Bayrak ◽

Faruk Polat

Keyword(s):

Genetic Algorithm ◽

Social Networks ◽

Feature Selection ◽

Link Prediction ◽

Feature Selection Method ◽

Selection Method ◽

Prediction Performance ◽

Feature Reduction ◽

Feature Subset ◽

Location Based Social Networks

In this study, we investigated feature-based approaches for improving the link prediction performance for location-based social networks (LBSNs) and analysed their performances. We developed new features based on time, common friend detail and place category information of check-in data in order to make use of information in the data which cannot be utilised by the existing features from the literature. We proposed a feature selection method to determine a feature subset that enhances the prediction performance with the removal of redundant features by clustering them. After clustering features, a genetic algorithm is used to determine the ones to select from each cluster. A non-monotonic and feasible feature selection is ensured by the proposed genetic algorithm. Results depict that both new features and the proposed feature selection method improved link prediction performance for LBSNs.

Download Full-text

Diagnosis of brushless synchronous generator using numerical modeling

COMPEL The International Journal for Computation and Mathematics in Electrical and Electronic Engineering ◽

10.1108/compel-01-2020-0018 ◽

2020 ◽

Vol 39 (5) ◽

pp. 1241-1254

Author(s):

Mehdi Rahnama ◽

Abolfazl Vahedi ◽

Arta Mohammad-Alikhani ◽

Noureddine Takorabet

Keyword(s):

Feature Selection ◽

Fault Detection ◽

Feature Selection Method ◽

Synchronous Generator ◽

Selection Method ◽

Open Circuit ◽

Content Type ◽

Detection Approach ◽

Terminal Voltage ◽

Harmonic Components

Purpose On-time fault diagnosis in electrical machines is a critical issue, as it can prevent the development of fault and also reduce the repairing time and cost. In brushless synchronous generators, the significance of the fault diagnosis is even more because they are widely used to generate electrical power all around the world. Therefore, this study aims to propose a fault detection approach for the brushless synchronous generator. In this approach, a novel extension of Relief feature selection method is developed. Design/methodology/approach In this paper, by taking the advantages of the finite element method (FEM), a brushless synchronous machine is modeled to evaluate the machine performance under two conditions. These conditions include the normal condition of the machine and one diode open-circuit of the rotating rectifier. Therefore, the harmonic behavior of the terminal voltage of the machine is obtained under these situations. Then, the harmonic components are ranked by using the extension of Relief to extract the most appropriate components for fault detection. Therefore, a fault detection approach is proposed based on the ranked harmonic components and support vector machine classifier. Findings The proposed diagnosis approach is verified by using an experimental test. Results show that by this approach open-circuit fault on the diode rectifier can effectively be detected by the accuracy of 98.5% and by using five harmonic components of the terminal voltage [1]. Originality/value In this paper, a novel feature selection method is proposed to select the most effective FFT components based on an extension of Relief method, and besides, FEM modeling of a brushless synchronous generator for normal and one diode open-circuit fault.

Download Full-text

Inflated prediction accuracy of neuropsychiatric biomarkers caused by data leakage in feature selection

Scientific Reports ◽

10.1038/s41598-021-87157-3 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Miseon Shim ◽

Seung-Hwan Lee ◽

Han-Jeong Hwang

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Feature Selection Method ◽

Selection Method ◽

Prediction Performance ◽

Machine Learning Techniques ◽

Learning Techniques ◽

Eeg Data ◽

Neuropsychiatric Diseases ◽

The Impact

AbstractIn recent years, machine learning techniques have been frequently applied to uncovering neuropsychiatric biomarkers with the aim of accurately diagnosing neuropsychiatric diseases and predicting treatment prognosis. However, many studies did not perform cross validation (CV) when using machine learning techniques, or others performed CV in an incorrect manner, leading to significantly biased results due to overfitting problem. The aim of this study is to investigate the impact of CV on the prediction performance of neuropsychiatric biomarkers, in particular, for feature selection performed with high-dimensional features. To this end, we evaluated prediction performances using both simulation data and actual electroencephalography (EEG) data. The overall prediction accuracies of the feature selection method performed outside of CV were considerably higher than those of the feature selection method performed within CV for both the simulation and actual EEG data. The differences between the prediction accuracies of the two feature selection approaches can be thought of as the amount of overfitting due to selection bias. Our results indicate the importance of correctly using CV to avoid biased results of prediction performance of neuropsychiatric biomarkers.

Download Full-text

Feature distillation and accumulated selection for automated fraudulent publisher classification from user click data of online advertising

Data Technologies and Applications ◽

10.1108/dta-09-2021-0233 ◽

2022 ◽

Vol ahead-of-print (ahead-of-print) ◽

Author(s):

Deepti Sisodia ◽

Dilip Singh Sisodia

Keyword(s):

Feature Selection ◽

Online Advertising ◽

Feature Selection Method ◽

Majority Voting ◽

Feature Subset ◽

Relevant Feature ◽

Selection Methods ◽

Content Type ◽

Optimal Feature Subset ◽

Optimal Feature

PurposeThe problem of choosing the utmost useful features from hundreds of features from time-series user click data arises in online advertising toward fraudulent publisher's classification. Selecting feature subsets is a key issue in such classification tasks. Practically, the use of filter approaches is common; however, they neglect the correlations amid features. Conversely, wrapper approaches could not be applied due to their complexities. Moreover, in particular, existing feature selection methods could not handle such data, which is one of the major causes of instability of feature selection.Design/methodology/approachTo overcome such issues, a majority voting-based hybrid feature selection method, namely feature distillation and accumulated selection (FDAS), is proposed to investigate the optimal subset of relevant features for analyzing the publisher's fraudulent conduct. FDAS works in two phases: (1) feature distillation, where significant features from standard filter and wrapper feature selection methods are obtained using majority voting; (2) accumulated selection, where we enumerated an accumulated evaluation of relevant feature subset to search for an optimal feature subset using effective machine learning (ML) models.FindingsEmpirical results prove enhanced classification performance with proposed features in average precision, recall, f1-score and AUC in publisher identification and classification.Originality/valueThe FDAS is evaluated on FDMA2012 user-click data and nine other benchmark datasets to gauge its generalizing characteristics, first, considering original features, second, with relevant feature subsets selected by feature selection (FS) methods, third, with optimal feature subset obtained by the proposed approach. ANOVA significance test is conducted to demonstrate significant differences between independent features.

Download Full-text

Clustering as feature selection method in spam classification: uncovering sick-leave sellers

Applied Computing and Informatics ◽

10.1108/aci-09-2021-0248 ◽

2021 ◽

Vol ahead-of-print (ahead-of-print) ◽

Author(s):

Mariam Elhussein ◽

Samiha Brahimi

Keyword(s):

Feature Selection ◽

Sick Leave ◽

Feature Selection Method ◽

Classification Performance ◽

Selection Method ◽

Content Type ◽

Similar Work ◽

Classifier Performance ◽

Feature Selection Approach ◽

Practical Implications

PurposeThis paper aims to propose a novel way of using textual clustering as a feature selection method. It is applied to identify the most important keywords in the profile classification. The method is demonstrated through the problem of sick-leave promoters on Twitter.Design/methodology/approachFour machine learning classifiers were used on a total of 35,578 tweets posted on Twitter. The data were manually labeled into two categories: promoter and nonpromoter. Classification performance was compared when the proposed clustering feature selection approach and the standard feature selection were applied.FindingsRadom forest achieved the highest accuracy of 95.91% higher than similar work compared. Furthermore, using clustering as a feature selection method improved the Sensitivity of the model from 73.83% to 98.79%. Sensitivity (recall) is the most important measure of classifier performance when detecting promoters’ accounts that have spam-like behavior.Research limitations/implicationsThe method applied is novel, more testing is needed in other datasets before generalizing its results.Practical implicationsThe model applied can be used by Saudi authorities to report on the accounts that sell sick-leaves online.Originality/valueThe research is proposing a new way textual clustering can be used in feature selection.

Download Full-text

Improvement of feature selection method in spam filtering

Journal of Computer Applications ◽

10.3724/sp.j.1087.2009.02812 ◽

2009 ◽

Vol 29 (10) ◽

pp. 2812-2815

Author(s):

Yang-zhu LU ◽

Xin-you ZHANG ◽

Yu QI

Keyword(s):

Feature Selection ◽

Feature Selection Method ◽

Selection Method ◽

Spam Filtering

Download Full-text

Feature Selection for Histopathological Image Classification using levy Flight Salp Swarm Optimizer

Recent Patents on Computer Science ◽

10.2174/2213275912666181210165129 ◽

2019 ◽

Vol 12 (4) ◽

pp. 329-337 ◽

Cited By ~ 2

Author(s):

Venubabu Rachapudi ◽

Golagani Lavanya Devi

Keyword(s):

Feature Selection ◽

Image Classification ◽

Feature Selection Method ◽

Selection Method ◽

Lévy Flight ◽

Levy Flight ◽

Local Optima ◽

Histopathological Image ◽

Surf Features ◽

Histopathological Image Classification

Background: An efficient feature selection method for Histopathological image classification plays an important role to eliminate irrelevant and redundant features. Therefore, this paper proposes a new levy flight salp swarm optimizer based feature selection method. Methods: The proposed levy flight salp swarm optimizer based feature selection method uses the levy flight steps for each follower salp to deviate them from local optima. The best solution returns the relevant and non-redundant features, which are fed to different classifiers for efficient and robust image classification. Results: The efficiency of the proposed levy flight salp swarm optimizer has been verified on 20 benchmark functions. The anticipated scheme beats the other considered meta-heuristic approaches. Furthermore, the anticipated feature selection method has shown better reduction in SURF features than other considered methods and performed well for histopathological image classification. Conclusion: This paper proposes an efficient levy flight salp Swarm Optimizer by modifying the step size of follower salp. The proposed modification reduces the chances of sticking into local optima. Furthermore, levy flight salp Swarm Optimizer has been utilized in the selection of optimum features from SURF features for the histopathological image classification. The simulation results validate that proposed method provides optimal values and high classification performance in comparison to other methods.

Download Full-text