Feature selection approaches for predictive modelling of cadmium sources and pollution levels in water springs

Author(s):  
Fatima K. Abu Salem ◽  
Mey Jurdi ◽  
Mohamad Alkadri ◽  
Firas Hachem ◽  
Hassan R. Dhaini
2016 ◽  
Vol 16 (2) ◽  
pp. 29-39
Author(s):  
Mariusz Kubus

Abstract Regression methods can be used for the valuation of real estate in the comparative approach. However, one of the problems of predictive modelling is the presence of redundant or irrelevant variables in data. Such variables can decrease the stability of models, and they can even reduce prediction accuracy. The choice of real estate’s features is largely determined by an appraiser, who is guided by his/her experience. Still, the use of statistical methods of a feature selection can lead to a more accurate valuation model. In the paper we apply regularized linear regression which belongs to embedded methods of a feature selection. For the considered data set of real estate land designated for single-family housing we obtained a model, which led to a more accurate valuation than some other popular linear models applied with or without a feature selection. To assess the model’s quality we used the leave-one-out cross-validation.


2019 ◽  
Vol 16 (8) ◽  
pp. 3379-3383
Author(s):  
Emad Afaq Khan ◽  
Sumaira Muhammad Hayat Khan

Attrition can be defined as the gradual reduction of a member or person in an organization due to retirement, resignation, or death. The loss can be defined as the number of employees leaving the organization, including voluntary and involuntary resignations. This study is about identifying the factors that affect the attrition and establishing a predictive model for employee attrition. The study first focuses on the problem statement and a breakdown on what attrition does to the organization. Followed by a detailed conceptual breakdown on attrition which is then discussed in the light of predictive modeling with the past supporting researches. The research involves data preprocessing with chi square versus logistic regression for feature selection, machine learning models and their comparison using the confusion matrix, precision, recall and f1-scores. The best results obtained was the logistic regression model with feature selection and the accuracy of the model is 86% with a 98% recall for the class 1 of attrition. The researcher wants to change the view on how attrition problem is tackled. Rather than knowing who to retain, the organization should know who to hire. This research sets a ground rule and tries to change the perspective on tackling the attrition problem.


Predictive analysis comprises a vast variety of statistical techniques like “machine learning”, “predictive modelling” and “data mining” and uses current and historical statistics to predict future outcomes. It is used in both business and educational domain with equal applicability.This paper aims to give an overview of the top work done so far in this field. We have briefed on classical as well as latest approaches (using“machine learning”) in predictive analysis. Main aspects like feature selection and algorithm selection along with corresponding application is explained. Some of the most quoted papers in this field along with their objectives are listed in a table. This paper can give a good heads up to whoever wants to know and use predictive analysis for his academic or business application.


2020 ◽  
Author(s):  
David O’Connor ◽  
Evelyn M.R. Lake ◽  
Dustin Scheinost ◽  
R. Todd Constable

AbstractIt is a long-standing goal of neuroimaging to produce reliable generalized models of brain behavior relationships. More recently data driven predicative models have become popular. Overfitting is a common problem with statistical models, which impedes model generalization. Cross validation (CV) is often used to give more balanced estimates of performance. However, CV does not provide guidance on how best to apply the models generated out-of-sample. As a solution, this study proposes an ensemble learning method, in this case bootstrap aggregating, or bagging, encompassing both model parameter estimation and feature selection. Here we investigate the use of bagging when generating predictive models of fluid intelligence (fIQ) using functional connectivity (FC). We take advantage of two large openly available datasets, the Human Connectome Project (HCP), and the Philadelphia Neurodevelopmental Cohort (PNC). We generate bagged and non-bagged models of fIQ in the HCP. Over various test-train splits, these models are evaluated in sample, on left out HCP data, and out-of-sample, on PNC data. We find that in sample, a non-bagged model performs best, however out-of-sample the bagged models perform best. We also find that feature selection can vary substantially within-sample. A more considered approach to feature selection, alongside data driven predictive modeling, is needed to improve cross sample performance of FC based brain behavior models.


2019 ◽  
Author(s):  
Tamas Spisak ◽  
Balint Kincses ◽  
Ulrike Bingel

AbstractIn a recent study, Dadi and colleagues make recommendations on optimal parameters for functional connectome-based predictive models. While the authors acknowledge that “optimal choices of parameters will differ on datasets with very different properties”, some questions regarding the universality of the recommended “default values” remain unanswered.Namely, as already briefly discussed by Dadi et al., the datasets used in the target study might not be representative regarding the sparsity of the (hidden) ground truth (i.e. the number of non-informative connections), which might affect the performance of L1- and L2-regularization approaches and feature selection.Here we exemplify that, at least in one of the investigated datasets systematic motion artefacts might bias the discriminative signal towards “non-sparsity”, which might lead to underestimating the performance of L1-regularized models and feature selection.We conclude that the expected sparsity of the discriminative signal should be carefully considered when planning predictive modelling workflows and the neuroscientific validity of predictive models should be investigated to account for non-neural confounds.


F1000Research ◽  
2021 ◽  
Vol 10 ◽  
pp. 1144
Author(s):  
Hu Ng ◽  
Azmin Alias bin Mohd Azha ◽  
Timothy Tzen Vun Yap ◽  
Vik Tor Goh

Background - Many factors affect student performance such as the individual’s background, habits, absenteeism and social activities. Using these factors, corrective actions can be determined to improve their performance. This study looks into the effects of these factors in predicting student performance from a data mining approach. This study presents a data mining approach in identify significant factors and predict student performance, based on two datasets collected from two secondary schools in Portugal. Methods – In this study, two datasets collected from two secondary schools in Portugal. First, the data used in the study is augmented to increase the sample size by merging the two datasets. Following that, data pre-processing is performed and the features are normalized with linear scaling to avoid bias on heavy weighted attributes.  The selected features are then assigned into four groups comprising of student background, lifestyle, history of grades and all features. Next, Boruta feature selection is performed to remove irrelevant features. Finally, the classification models of Support Vector Machine (SVM), Naïve Bayes (NB), and Multilayer Perceptron (MLP) origins are designed and their performances evaluated. Results - The models were trained and evaluated on an integrated dataset comprising 1044 student records with 33 features, after feature selection. The classification was performed with SVM, NB and MLP with 60-40 and 50-50 train-test splits and 10-fold cross validation. GridSearchCV was applied to perform hyperparameter tuning. The performance metrics were accuracy, precision, recall and F1-Score. SVM obtained the highest accuracy with scores of 77%, 80%, 91% and 90% on background, lifestyle, history of grades and all features respectively in 50-50 train-test splits for binary classification (pass or fail). SVM also obtained highest accuracy for five class classification (grade A, B, C, D and F) with 39%, 38%, 73% and 71% for the four categories respectively.


Author(s):  
Lindsey M. Kitchell ◽  
Francisco J. Parada ◽  
Brandi L. Emerick ◽  
Tom A. Busey

Sign in / Sign up

Export Citation Format

Share Document