Handling incomplete data classification using imputed feature selected bagging (IFBag) method

2021 ◽  
Vol 25 (4) ◽  
pp. 825-846
Author(s):  
Ahmad Jaffar Khan ◽  
Basit Raza ◽  
Ahmad Raza Shahid ◽  
Yogan Jaya Kumar ◽  
Muhammad Faheem ◽  
...  

Almost all real-world datasets contain missing values. Classification of data with missing values can adversely affect the performance of a classifier if not handled correctly. A common approach used for classification with incomplete data is imputation. Imputation transforms incomplete data with missing values to complete data. Single imputation methods are mostly less accurate than multiple imputation methods which are often computationally much more expensive. This study proposes an imputed feature selected bagging (IFBag) method which uses multiple imputation, feature selection and bagging ensemble learning approach to construct a number of base classifiers to classify new incomplete instances without any need for imputation in testing phase. In bagging ensemble learning approach, data is resampled multiple times with substitution, which can lead to diversity in data thus resulting in more accurate classifiers. The experimental results show the proposed IFBag method is considerably fast and gives 97.26% accuracy for classification with incomplete data as compared to common methods used.

Transmisi ◽  
2020 ◽  
Vol 22 (3) ◽  
pp. 102-106
Author(s):  
Farrikh Alzami ◽  
Aries Jehan Tamamy ◽  
Ricardus Anggi Pramunendar ◽  
Zaenal Arifin

The ensemble learning approach, especially in classification, has been widely carried out and is successful in many scopes, but unfortunately not many ensemble approaches are used for the detection and classification of epilepsy in biomedical terms. Compared to using a simple bagging ensemble framework, we propose a fusion bagging-based ensemble framework (FBEF) that uses 3 weak learners in each oracle, using fusion rules, a weak learner will give results as predictors of the oracle. All oracle predictors will be included in the trust factor to get a better prediction and classification. Compared to traditional Ensemble bagging and single learner type Ensemble bagging, our framework outperforms similar research in relation to the epileptic seizure classification as 98.11±0.68 and several real-world datasets


2019 ◽  
Vol 6 (339) ◽  
pp. 73-98
Author(s):  
Małgorzata Aleksandra Misztal

The problem of incomplete data and its implications for drawing valid conclusions from statistical analyses is not related to any particular scientific domain, it arises in economics, sociology, education, behavioural sciences or medicine. Almost all standard statistical methods presume that every object has information on every variable to be included in the analysis and the typical approach to missing data is simply to delete them. However, this leads to ineffective and biased analysis results and is not recommended in the literature. The state of the art technique for handling missing data is multiple imputation. In the paper, some selected multiple imputation methods were taken into account. Special attention was paid to using principal components analysis (PCA) as an imputation method. The goal of the study was to assess the quality of PCA‑based imputations as compared to two other multiple imputation techniques: multivariate imputation by chained equations (MICE) and missForest. The comparison was made by artificially simulating different proportions (10–50%) and mechanisms of missing data using 10 complete data sets from the UCI repository of machine learning databases. Then, missing values were imputed with the use of MICE, missForest and the PCA‑based method (MIPCA). The normalised root mean square error (NRMSE) was calculated as a measure of imputation accuracy. On the basis of the conducted analyses, missForest can be recommended as a multiple imputation method providing the lowest rates of imputation errors for all types of missingness. PCA‑based imputation does not perform well in terms of accuracy.


2020 ◽  
Vol 2020 ◽  
pp. 1-11
Author(s):  
Kamran Mehrabani-Zeinabad ◽  
Marziyeh Doostfatemeh ◽  
Seyyed Mohammad Taghi Ayatollahi

Missing data is one of the most important causes in reduction of classification accuracy. Many real datasets suffer from missing values, especially in medical sciences. Imputation is a common way to deal with incomplete datasets. There are various imputation methods that can be applied, and the choice of the best method depends on the dataset conditions such as sample size, missing percent, and missing mechanism. Therefore, the better solution is to classify incomplete datasets without imputation and without any loss of information. The structure of the “Bayesian additive regression trees” (BART) model is improved with the “Missingness Incorporated in Attributes” approach to solve its inefficiency in handling the missingness problem. Implementation of MIA-within-BART is named “BART.m”. As the abilities of BART.m are not investigated in classification of incomplete datasets, this simulation-based study aimed to provide such resource. The results indicate that BART.m can be used even for datasets with 90 missing present and more importantly, it diagnoses the irrelevant variables and removes them by its own. BART.m outperforms common models for classification with incomplete data, according to accuracy and computational time. Based on the revealed properties, it can be said that BART.m is a high accuracy model in classification of incomplete datasets which avoids any assumptions and preprocess steps.


Author(s):  
Thelma Dede Baddoo ◽  
Zhijia Li ◽  
Samuel Nii Odai ◽  
Kenneth Rodolphe Chabi Boni ◽  
Isaac Kwesi Nooni ◽  
...  

Reconstructing missing streamflow data can be challenging when additional data are not available, and missing data imputation of real-world datasets to investigate how to ascertain the accuracy of imputation algorithms for these datasets are lacking. This study investigated the necessary complexity of missing data reconstruction schemes to obtain the relevant results for a real-world single station streamflow observation to facilitate its further use. This investigation was implemented by applying different missing data mechanisms spanning from univariate algorithms to multiple imputation methods accustomed to multivariate data taking time as an explicit variable. The performance accuracy of these schemes was assessed using the total error measurement (TEM) and a recommended localized error measurement (LEM) in this study. The results show that univariate missing value algorithms, which are specially developed to handle univariate time series, provide satisfactory results, but the ones which provide the best results are usually time and computationally intensive. Also, multiple imputation algorithms which consider the surrounding observed values and/or which can understand the characteristics of the data provide similar results to the univariate missing data algorithms and, in some cases, perform better without the added time and computational downsides when time is taken as an explicit variable. Furthermore, the LEM would be especially useful when the missing data are in specific portions of the dataset or where very large gaps of ‘missingness’ occur. Finally, proper handling of missing values of real-world hydroclimatic datasets depends on imputing and extensive study of the particular dataset to be imputed.


2015 ◽  
Vol 19 ◽  
pp. 56-67 ◽  
Author(s):  
Mohamad M. Al Rahhal ◽  
Yakoub Bazi ◽  
Naif Alajlan ◽  
Salim Malek ◽  
Haikel Al-Hichri ◽  
...  

2017 ◽  
Vol 10 (19) ◽  
pp. 1-7 ◽  
Author(s):  
Geeta Chhabra ◽  
Vasudha Vashisht ◽  
Jayanthi Ranjan ◽  
◽  
◽  
...  

2021 ◽  
Author(s):  
Shuo Feng ◽  
Celestin Hategeka ◽  
Karen Ann Grépin

Abstract Background : Poor data quality is limiting the greater use of data sourced from routine health information systems (RHIS), especially in low and middle-income countries. An important part of this issue comes from missing values, where health facilities, for a variety of reasons, miss their reports into the central system. Methods : Using data from the Health Management Information System (HMIS) and the advent of COVID-19 pandemic in the Democratic Republic of the Congo (DRC) as an illustrative case study, we implemented six commonly-used imputation methods using the DRC’s HMIS datasets and evaluated their performance through various statistical techniques, i.e., simple linear regression, segmented regression which is widely used in interrupted time series studies, and parametric comparisons through t-tests and non-parametric comparisons through Wilcoxon Rank-Sum tests. We also examined the performance of these six imputation methods under different missing mechanisms and tested their stability to changes in the data. Results : For regression analyses, there was no substantial difference found in the results generated from all methods except mean imputation and exclusion & interpolation when the RHIS dataset contained less than 20% missing values. However, as the missing proportion grew, machine learning methods such as missForest and k -NN started to produce biased estimates, and they were found to be also lack of robustness to minimal changes in data or to consecutive missingness. On the other hand, multiple imputation generated the overall most unbiased estimates and was the most robust to all changes in data. For comparing group means through t-tests, the results from mean imputation and exclusion & interpolation disagreed with the true inference obtained using the complete data, suggesting that these two methods would not only lead to biased regression estimates but also generate unreliable t-test results. Conclusions : We recommend the use of multiple imputation in addressing missing values in RHIS datasets. In cases necessary computing resources are unavailable to multiple imputation, one may consider seasonal decomposition as the next best method. Mean imputation and exclusion & interpolation, however, always produced biased and misleading results in the subsequent analyses, and thus their use in the handling of missing values should be discouraged. Keywords : Missing Data; Routine Health Information Systems (RHIS); Health Management Information System (HMIS); Health Services Research; Low and middle-income countries (LMICs); Multiple imputation


2020 ◽  
Vol 28 (108) ◽  
pp. 599-621
Author(s):  
Maria Eugénia Ferrão ◽  
Paula Prata ◽  
Maria Teresa Gonzaga Alves

Abstract Almost all quantitative studies in educational assessment, evaluation and educational research are based on incomplete data sets, which have been a problem for years without a single solution. The use of big identifiable data poses new challenges in dealing with missing values. In the first part of this paper, we present the state-of-art of the topic in the Brazilian education scientific literature, and how researchers have dealt with missing data since the turn of the century. Next, we use open access software to analyze real-world data, the 2017 Prova Brasil , for several federation units to document how the naïve assumption of missing completely at random may substantially affect statistical conclusions, researcher interpretations, and subsequent implications for policy and practice. We conclude with straightforward suggestions for any education researcher on applying R routines to conduct the hypotheses test of missing completely at random and, if the null hypothesis is rejected, then how to implement the multiple imputation, which appears to be one of the most appropriate methods for handling missing data.


Sign in / Sign up

Export Citation Format

Share Document