Evaluating Imputation Methods to Improve Data Availability in a Software Estimation Dataset

Missing of partial data is a problem that is prevalent in most of the datasets used for statistical analysis. In this study, we analyzed the missing values in ISBSG R1 2018 dataset and addressed the problem through imputation, a machine learning technique which can increase the availability of data. Additionally, we compare the performance of three imputation methods: Classification and Regression Trees (CART), Polynomial Regression (PR), Predictive Mean Matching (PMM), and Random Forest (RF) applied to ISBSG R1 2018 dataset available from International Standards Benchmarks Group. Through imputation, we were able to increase data availability by four times. We also evaluated the performance of these methods against the original dataset without imputation using an ensemble of Linear Regression, Gradient Boosting, Random Forest, and ANN. Imputation using CART can increase the availability of the overall dataset but only at the loss of some predictive capability of the model. However, CART remains the option of choice to extend the usability of the data by retaining rows that are otherwise removed from the dataset in traditional methods. In our experiments, this approach has been able to increase the usability of the original dataset to 63%, but with 2 to 3% decrease in its overall predictive performance.

Download Full-text

Learning from Imbalanced Educational Data Using Ensemble Machine Learning Algorithms

Webology ◽

10.14704/web/v18si01/web18053 ◽

2021 ◽

Vol 18 (Special Issue 01) ◽

pp. 183-195

Author(s):

Thingbaijam Lenin ◽

N. Chandrasekaran

Keyword(s):

Machine Learning ◽

Random Forest ◽

Missing Values ◽

Machine Learning Techniques ◽

Gradient Boosting ◽

Adaptive Boosting ◽

Stochastic Gradient Boosting ◽

Ensemble Machine Learning ◽

Learning Techniques ◽

Student’S Performance

Student’s academic performance is one of the most important parameters for evaluating the standard of any institute. It has become a paramount importance for any institute to identify the student at risk of underperforming or failing or even drop out from the course. Machine Learning techniques may be used to develop a model for predicting student’s performance as early as at the time of admission. The task however is challenging as the educational data required to explore for modelling are usually imbalanced. We explore ensemble machine learning techniques namely bagging algorithm like random forest (rf) and boosting algorithms like adaptive boosting (adaboost), stochastic gradient boosting (gbm), extreme gradient boosting (xgbTree) in an attempt to develop a model for predicting the student’s performance of a private university at Meghalaya using three categories of data namely demographic, prior academic record, personality. The collected data are found to be highly imbalanced and also consists of missing values. We employ k-nearest neighbor (knn) data imputation technique to tackle the missing values. The models are developed on the imputed data with 10 fold cross validation technique and are evaluated using precision, specificity, recall, kappa metrics. As the data are imbalanced, we avoid using accuracy as the metrics of evaluating the model and instead use balanced accuracy and F-score. We compare the ensemble technique with single classifier C4.5. The best result is provided by random forest and adaboost with F-score of 66.67%, balanced accuracy of 75%, and accuracy of 96.94%.

Download Full-text

Methodologies for Imputation of Missing Values in Rice Pest Data

Current Journal of Applied Science and Technology ◽

10.9734/cjast/2021/v40i531304 ◽

2021 ◽

pp. 64-73

Author(s):

V. Jinubala ◽

P. Jeyakumar

Keyword(s):

Data Mining ◽

Comparative Analysis ◽

Missing Values ◽

Large Data ◽

Research Field ◽

Data Set ◽

Imputation Methods ◽

Predictive Mean Matching ◽

Rice Pest ◽

Better Than

Data Mining is an emerging research field in the analysis of agricultural data. In fact the most important problem in extracting knowledge from the agriculture data is the missing values of the attributes in the selected data set. If such deficiencies are there in the selected data set then it needs to be cleaned during preprocessing of the data in order to obtain a functional data. The main objective of this paper is to analyse the effectiveness of the various imputation methods in producing a complete data set that can be more useful for applying data mining techniques and presented a comparative analysis of the imputation methods for handling missing values. The pest data set of rice crop collected throughout Maharashtra state under Crop Pest Surveillance and Advisory Project (CROPSAP) during 2009-2013 was used for analysis. The different methodologies like Deleting of rows, Mean & Median, Linear regression and Predictive Mean Matching were analysed for Imputation of Missing values. The comparative analysis shows that Predictive Mean Matching Methodology was better than other methods and effective for imputation of missing values in large data set.

Download Full-text

Comparison of Random Forest and Gradient Boosting Machine Models for Predicting Demolition Waste Based on Small Datasets and Categorical Variables

International Journal of Environmental Research and Public Health ◽

10.3390/ijerph18168530 ◽

2021 ◽

Vol 18 (16) ◽

pp. 8530

Author(s):

Gi-Wook Cha ◽

Hyeun-Jun Moon ◽

Young-Chan Kim

Keyword(s):

Random Forest ◽

Predictive Models ◽

Model Development ◽

Predictive Performance ◽

Construction And Demolition Waste ◽

Categorical Variables ◽

Gradient Boosting ◽

Demolition Waste ◽

Gradient Boosting Machine ◽

Generation Information

Construction and demolition waste (DW) generation information has been recognized as a tool for providing useful information for waste management. Recently, numerous researchers have actively utilized artificial intelligence technology to establish accurate waste generation information. This study investigated the development of machine learning predictive models that can achieve predictive performance on small datasets composed of categorical variables. To this end, the random forest (RF) and gradient boosting machine (GBM) algorithms were adopted. To develop the models, 690 building datasets were established using data preprocessing and standardization. Hyperparameter tuning was performed to develop the RF and GBM models. The model performances were evaluated using the leave-one-out cross-validation technique. The study demonstrated that, for small datasets comprising mainly categorical variables, the bagging technique (RF) predictions were more stable and accurate than those of the boosting technique (GBM). However, GBM models demonstrated excellent predictive performance in some DW predictive models. Furthermore, the RF and GBM predictive models demonstrated significantly differing performance across different types of DW. Certain RF and GBM models demonstrated relatively low predictive performance. However, the remaining predictive models all demonstrated excellent predictive performance at R2 values > 0.6, and R values > 0.8. Such differences are mainly because of the characteristics of features applied to model development; we expect the application of additional features to improve the performance of the predictive models. The 11 DW predictive models developed in this study will be useful for establishing detailed DW management strategies.

Download Full-text

A Benchmark for Data Imputation Methods

Frontiers in Big Data ◽

10.3389/fdata.2021.693674 ◽

2021 ◽

Vol 4 ◽

Author(s):

Sebastian Jäger ◽

Arndt Allhorn ◽

Felix Bießmann

Keyword(s):

Data Quality ◽

Missing Values ◽

Predictive Performance ◽

Heterogeneous Data ◽

Learning Approaches ◽

Imputation Methods ◽

High Data ◽

Software Applications ◽

Incomplete Datasets ◽

The Impact

With the increasing importance and complexity of data pipelines, data quality became one of the key challenges in modern software applications. The importance of data quality has been recognized beyond the field of data engineering and database management systems (DBMSs). Also, for machine learning (ML) applications, high data quality standards are crucial to ensure robust predictive performance and responsible usage of automated decision making. One of the most frequent data quality problems is missing values. Incomplete datasets can break data pipelines and can have a devastating impact on downstream ML applications when not detected. While statisticians and, more recently, ML researchers have introduced a variety of approaches to impute missing values, comprehensive benchmarks comparing classical and modern imputation approaches under fair and realistic conditions are underrepresented. Here, we aim to fill this gap. We conduct a comprehensive suite of experiments on a large number of datasets with heterogeneous data and realistic missingness conditions, comparing both novel deep learning approaches and classical ML imputation methods when either only test or train and test data are affected by missing data. Each imputation method is evaluated regarding the imputation quality and the impact imputation has on a downstream ML task. Our results provide valuable insights into the performance of a variety of imputation methods under realistic conditions. We hope that our results help researchers and engineers to guide their data preprocessing method selection for automated data quality improvement.

Download Full-text

Stock Portfolio Prediction by Multi-Target Decision Support

iSys - Brazilian Journal of Information Systems ◽

10.5753/isys.2019.381 ◽

2019 ◽

Vol 12 (1) ◽

pp. 05-27

Author(s):

Everton Jose Santana ◽

João Augusto Provin Ribeiro Da silva ◽

Saulo Martiello Mastelini ◽

Sylvio Barbon Jr

Keyword(s):

Decision Support ◽

Random Forest ◽

Stock Market ◽

Prediction Models ◽

Deep Structure ◽

Predictive Performance ◽

Gradient Boosting ◽

Support Vector ◽

Learning Approaches ◽

Extreme Gradient Boosting

Investing in the stock market is a complex process due to its high volatility caused by factors as exchange rates, political events, inflation and the market history. To support investor's decisions, the prediction of future stock price and economic metrics is valuable. With the hypothesis that there is a relation among investment performance indicators, the goal of this paper was exploring multi-target regression (MTR) methods to estimate 6 different indicators and finding out the method that would best suit in an automated prediction tool for decision support regarding predictive performance. The experiments were based on 4 datasets, corresponding to 4 different time periods, composed of 63 combinations of weights of stock-picking concepts each, simulated in the US stock market. We compared traditional machine learning approaches with seven state-of-the-art MTR solutions: Stacked Single Target, Ensemble of Regressor Chains, Deep Structure for Tracking Asynchronous Regressor Stacking, Deep Regressor Stacking, Multi-output Tree Chaining, Multi-target Augment Stacking and Multi-output Random Forest (MORF). With the exception of MORF, traditional approaches and the MTR methods were evaluated with Extreme Gradient Boosting, Random Forest and Support Vector Machine regressors. By means of extensive experimental evaluation, our results showed that the most recent MTR solutions can achieve suitable predictive performance, improving all the scenarios (14.70% in the best one, considering all target variables and periods). In this sense, MTR is a proper strategy for building stock market decision support system based on prediction models.

Download Full-text

Comparison of Performance of Recursive Partitioning Models and Parametric Imputation Models Based On Predictive Mean Matching In Multiple Imputation By Chained Equations, in Order to Impute The Missing Values of The Binary Outcome, in The Presence of An Interaction Effect

10.21203/rs.3.rs-961665/v1 ◽

2021 ◽

Author(s):

Sara Javadi ◽

Abbas Bahrampour ◽

Mohammad Mehdi Saber ◽

Mohammad Reza Baneshi

Keyword(s):

Multiple Imputation ◽

Interaction Effect ◽

Coverage Probability ◽

Missing Values ◽

Recursive Partitioning ◽

Data Generation ◽

Imputation Model ◽

Imputation Methods ◽

Partitioning Methods ◽

Predictive Mean Matching

Abstract Background: Among the new multiple imputation methods, Multiple Imputation by Chained ‎Equations (MICE) is a ‎popular ‎approach for implementing multiple imputations because of its ‎flexibility. Our main focus in this study ‎is to ‎compare the performance of parametric ‎imputation models based on predictive mean matching and ‎recursive partitioning methods ‎in multiple imputation by chained equations in the ‎presence of interaction in the ‎data.Methods: We compared the performance of parametric and tree-based imputation methods via simulation using two data generation models. For each combination of data generation model and imputation method, the following steps were performed: data generation, removal of observations, imputation, logistic regression analysis, and calculation of bias, Coverage Probability (CP), and Confidence Interval (CI) width for each coefficient Furthermore, model-based and empirical SE, and estimated proportion of the variance attributable to the missing data (λ) were calculated.Results: ‎We have shown by simulation that to impute a binary response in ‎observations involving an ‎interaction, manually interring the interaction term into the imputation model in the ‎predictive mean matching ‎model improves the performance of the PMM method compared to the recursive partitioning models in ‎ ‎multiple imputation by chained equations.‎ The parametric method in which we entered the interaction model into the imputation model (MICE-‎‎‎Interaction) led to smaller bias, slightly higher coverage probability for the interaction effect, but it ‎had ‎slightly ‎wider confidence intervals than tree-based imputation (especially classification and ‎regression ‎trees). Conclusions: The application of MICE-Interaction led to better performance than ‎recursive ‎partitioning methods in MICE, although ‎the user is interested in estimating the interaction and does not ‎know ‎enough about the structure of the observations, recursive partitioning methods can be ‎suggested to impute ‎the ‎missing values.

Download Full-text

Random forest-based imputation outperforms other methods for imputing LC-MS metabolomics data: a comparative study

BMC Bioinformatics ◽

10.1186/s12859-019-3110-0 ◽

2019 ◽

Vol 20 (1) ◽

Cited By ~ 6

Author(s):

Marietta Kokla ◽

Jyrki Virtanen ◽

Marjukka Kolehmainen ◽

Jussi Paananen ◽

Kati Hanhineva

Keyword(s):

Statistical Analysis ◽

Random Forest ◽

Missing Values ◽

Mean Squared Error ◽

Imputation Method ◽

Data Matrix ◽

Metabolomics Data ◽

Imputation Methods ◽

Molecular Features ◽

Different Origin

Abstract Background LC-MS technology makes it possible to measure the relative abundance of numerous molecular features of a sample in single analysis. However, especially non-targeted metabolite profiling approaches generate vast arrays of data that are prone to aberrations such as missing values. No matter the reason for the missing values in the data, coherent and complete data matrix is always a pre-requisite for accurate and reliable statistical analysis. Therefore, there is a need for proper imputation strategies that account for the missingness and reduce the bias in the statistical analysis. Results Here we present our results after evaluating nine imputation methods in four different percentages of missing values of different origin. The performance of each imputation method was analyzed by Normalized Root Mean Squared Error (NRMSE). We demonstrated that random forest (RF) had the lowest NRMSE in the estimation of missing values for Missing at Random (MAR) and Missing Completely at Random (MCAR). In case of absent values due to Missing Not at Random (MNAR), the left truncated data was best imputed with minimum value imputation. We also tested the different imputation methods for datasets containing missing data of various origin, and RF was the most accurate method in all cases. The results were obtained by repeating the evaluation process 100 times with the use of metabolomics datasets where the missing values were introduced to represent absent data of different origin. Conclusion Type and rate of missingness affects the performance and suitability of imputation methods. RF-based imputation method performs best in most of the tested scenarios, including combinations of different types and rates of missingness. Therefore, we recommend using random forest-based imputation for imputing missing metabolomics data, and especially in situations where the types of missingness are not known in advance.

Download Full-text

Generating Missing Oilfield Data Using A Generative Adversarial Imputation Network GAIN

10.2118/200766-ms ◽

2021 ◽

Author(s):

Justin Andrews ◽

Sheldon Gorell

Keyword(s):

Machine Learning ◽

Missing Data ◽

Missing Values ◽

Generative Adversarial Networks ◽

Imputation Methods ◽

Adversarial Networks ◽

Incomplete Observations ◽

Original Dataset ◽

Single Field ◽

Standard Practices

Abstract Missing values and incomplete observations can exist in just about ever type of recorded data. With analytical modeling, and machine learning in particular, the quantity and quality of available data is paramount to acquiring reliable results. Within the oil industry alone, priorities in which data is important can vary from company to company, leading to available knowledge of a single field to vary from place to place. With machine learning requiring very complete sets of data, this issue can require whole portions of data to be discarded in order to create an appropriate dataset. Value imputation has emerged as a valuable solution in cleaning up datasets, and as current technology has advanced new generative machine learning methods have been used to generate images and data that is all but indistinguishable from reality. Using an adaptation of the standard Generative Adversarial Networks (GAN) approach known as a Generative Adversarial Imputation Network (GAIN), this paper evaluates this method and other imputation methods for filling in missing values. Using a gathered fully observed set of data, smaller datasets with randomly masked missing values were generated to validate the effectiveness of the various imputation methods; allowing comparisons to be made against the original dataset. The study found that with various sizes of missing data percentages withing the sets, the "filled in" data could be used with surprising accuracy for further analytics. This paper compares GAIN along with several commonly used imputation methods against more standard practices such as data cropping or filling in with average values for filling in missing data. GAIN, as well as the various imputation methods described are quantified for there ability to fill in data. The study will discuss how the GAIN model can quickly provide the data necessary for analytical studies and prediction of results for future projects.

Download Full-text