scholarly journals Class Center-Based Firefly Algorithm for Handling Missing Data

Author(s):  
Heru Nugroho ◽  
Nugraha Priya Utama ◽  
Kridanto Surendro

Abstract A significant advancement that occurs during the data cleaning stage is estimating missing data. Studies have shown that improper data handling leads to inaccurate analysis. Furthermore, most studies indicate the occurrence of missing data irrespective of the correlation between attributes . However, an adaptive search procedure helps to determine the estimates of the missing data when correlations between attributes are considered in the process. Firefly Algorithm (FA) implements an adaptive search procedure in the imputation of the missing data by determining the estimated value closest to others' value. Therefore, this study proposes a class center-based adaptive approach model for retrieving missing data by considering the attribute correlation in the imputation process (C3-FA). The result showed that the class center-based firefly algorithm (FA) is an efficient technique for obtaining the actual value in handling missing data with the Pearson correlation coefficient ( r ) and root mean squared error (RMSE) close to 1 and 0, respectively. In addition, the proposed method has the ability to maintain the true distribution of data values. This is indicated by the Kolmogorov–Smirnov test, which stated that the value of DKS for most attributes in the dataset is generally closer to 0. Furthermore, the accuracy evaluation results using three classifiers showed that the proposed method produces good accuracy.

2021 ◽  
Vol 8 (1) ◽  
Author(s):  
Heru Nugroho ◽  
Nugraha Priya Utama ◽  
Kridanto Surendro

AbstractA significant advancement that occurs during the data cleaning stage is estimating missing data. Studies have shown that improper data handling leads to inaccurate analysis. Furthermore, most studies indicate the occurrence of missing data irrespective of the correlation between attributes. However, an adaptive search procedure helps to determine the estimates of the missing data when correlations between attributes are considered in the process. Firefly Algorithm (FA) implements an adaptive search procedure in the imputation of the missing data by determining the estimated value closest to others' value. Therefore, this study proposes a class center-based adaptive approach model for retrieving missing data by considering the attribute correlation in the imputation process (C3-FA). The result showed that the class center-based firefly algorithm (FA) is an efficient technique for obtaining the actual value in handling missing data with the Pearson correlation coefficient (r) and root mean squared error (RMSE) close to 1 and 0, respectively. In addition, the proposed method has the ability to maintain the true distribution of data values. This is indicated by the Kolmogorov–Smirnov test, which stated that the value of DKS for most attributes in the dataset is generally closer to 0. Furthermore, the accuracy evaluation results using three classifiers showed that the proposed method produces good accuracy.


2020 ◽  
Author(s):  
Heru Nugroho ◽  
Nugraha Priya Utama ◽  
Kridanto Surendro

Abstract Estimating missing data in a dataset is a significant advance during the data cleaning stage. Improper data handling can make inaccurate results when conducting data analysis. Most of the research about missing data estimation is irrespective of the correlation between attributes. However, an adaptive search procedure helps find the estimates of the missing data when correlations between attributes are considered in the process. Firefly Algorithm (FA) implements an adaptive search procedure in the imputation of the missing data by finding the estimated value that is closest to the value in other data known. Therefore, this study proposes a class center-based adaptive approach model for missing data by considering the attribute correlation in the imputation process (C3-FA). Based on the experiment, the general result find that the class center-based firefly algorithm is an efficient technique for getting the actual value in handling the missing data. This can be seen on the value of Pearson correlation coefficient (r) that close to 1 and the root mean squared error (RMSE) value is generally closer to 0. In addition, the proposed method can maintain the true distribution of data values. This is indicated by the Kolmogorov–Smirnov test that value of DKS for most of the attributes in the dataset is generally closer to 0. Also, the results of the accuracy evaluation using three classifiers, showed that the proposed method produces good accuracy.


10.2196/27386 ◽  
2021 ◽  
Vol 9 (12) ◽  
pp. e27386
Author(s):  
Qingyu Chen ◽  
Alex Rankine ◽  
Yifan Peng ◽  
Elaheh Aghaarabi ◽  
Zhiyong Lu

Background Semantic textual similarity (STS) measures the degree of relatedness between sentence pairs. The Open Health Natural Language Processing (OHNLP) Consortium released an expertly annotated STS data set and called for the National Natural Language Processing Clinical Challenges. This work describes our entry, an ensemble model that leverages a range of deep learning (DL) models. Our team from the National Library of Medicine obtained a Pearson correlation of 0.8967 in an official test set during 2019 National Natural Language Processing Clinical Challenges/Open Health Natural Language Processing shared task and achieved a second rank. Objective Although our models strongly correlate with manual annotations, annotator-level correlation was only moderate (weighted Cohen κ=0.60). We are cautious of the potential use of DL models in production systems and argue that it is more critical to evaluate the models in-depth, especially those with extremely high correlations. In this study, we benchmark the effectiveness and efficiency of top-ranked DL models. We quantify their robustness and inference times to validate their usefulness in real-time applications. Methods We benchmarked five DL models, which are the top-ranked systems for STS tasks: Convolutional Neural Network, BioSentVec, BioBERT, BlueBERT, and ClinicalBERT. We evaluated a random forest model as an additional baseline. For each model, we repeated the experiment 10 times, using the official training and testing sets. We reported 95% CI of the Wilcoxon rank-sum test on the average Pearson correlation (official evaluation metric) and running time. We further evaluated Spearman correlation, R², and mean squared error as additional measures. Results Using only the official training set, all models obtained highly effective results. BioSentVec and BioBERT achieved the highest average Pearson correlations (0.8497 and 0.8481, respectively). BioSentVec also had the highest results in 3 of 4 effectiveness measures, followed by BioBERT. However, their robustness to sentence pairs of different similarity levels varies significantly. A particular observation is that BERT models made the most errors (a mean squared error of over 2.5) on highly similar sentence pairs. They cannot capture highly similar sentence pairs effectively when they have different negation terms or word orders. In addition, time efficiency is dramatically different from the effectiveness results. On average, the BERT models were approximately 20 times and 50 times slower than the Convolutional Neural Network and BioSentVec models, respectively. This results in challenges for real-time applications. Conclusions Despite the excitement of further improving Pearson correlations in this data set, our results highlight that evaluations of the effectiveness and efficiency of STS models are critical. In future, we suggest more evaluations on the generalization capability and user-level testing of the models. We call for community efforts to create more biomedical and clinical STS data sets from different perspectives to reflect the multifaceted notion of sentence-relatedness.


2018 ◽  
Author(s):  
Cailey Elizabeth Fitzgerald ◽  
Ryne Estabrook ◽  
Daniel Patrick Martin ◽  
Andreas Markus Brandmaier ◽  
Timo von Oertzen

Missing data are ubiquitous in both small and large datasets. Missing data may come about as a result of coding or computer error, participant absences, or it may be intentional, as in planned missing designs. We discuss missing data as it relates to goodness-of-fit indices in Structural Equation Modeling (SEM), specifically the effects of missing data on the Root Mean Squared Error of Approximation (RMSEA). We use simulations to show that naive implementations of the RMSEA have a downward bias in the presence of missing data and, thus, overestimate model goodness-of-fit. Unfortunately, many state-of-the-art software packages report the biased form of RMSEA. As a consequence, the community may have been accepting a much larger fraction of models with non-acceptable model fit. We propose a bias-correction for the RMSEA based on information-theoretic considerations that take into account the expected misfit of a person with fully observed data. This results in an RMSEA which is asymptotically independent of the proportion of missing data for misspecified models. Importantly, results of the corrected RMSEA computation are identical to naive RMSEA if there are no missing data.


2015 ◽  
Vol 78 (4) ◽  
pp. 668-674 ◽  
Author(s):  
MATTHEW EADY ◽  
BOSOON PARK ◽  
SUN CHOI

This study was designed to evaluate hyperspectral microscope images for early and rapid detection of Salmonella serotypes Enteritidis, Heidelberg, Infantis, Kentucky, and Typhimurium at incubation times of 6, 8, 10, 12, and 24 h. Images were collected by an acousto-optical tunable filter hyperspectral microscope imaging system with a metal halide light source measuring 89 contiguous wavelengths every 4 nm between 450 and 800 nm. Pearson correlation values were calculated for incubation times of 8, 10, and 12 h and compared with data for 24 h to evaluate the change in spectral signatures from bacterial cells over time. Regions of interest were analyzed at 30% of the pixels in an average cell size. Spectral data were preprocessed by applying a global data transformation algorithm and then subjected to principal component analysis (PCA). The Mahalanobis distance was calculated from PCA score plots for analyzing serotype cluster separation. Partial least-squares regression was applied for calibration and validation of the model, and soft independent modeling of class analogy was utilized to classify serotype clusters in the training set. Pearson correlation values indicate very similar spectral patterns for reduced incubation times ranging from 0.9869 to 0.9990. PCA score plots indicated cluster separation at all incubation times, with incubation time Mahalanobis distances of 2.146 to 27.071. Partial least-squares regression had a maximum root mean squared error of calibration of 0.0025 and a root mean squared error of validation of 0.0030. Soft independent modeling of class analogy correctly classified values at 8 h (98.32%), 10 h (96.67%), 12 h (88.33%), and 24 h (98.67%) with the optimal number of principal components (four or five). The results of this study suggest that Salmonella serotypes can be classified by applying a PCA to hyperspectral microscope imaging data from samples after only 8 h of incubation.


Methodology ◽  
2021 ◽  
Vol 17 (3) ◽  
pp. 189-204
Author(s):  
Cailey E. Fitzgerald ◽  
Ryne Estabrook ◽  
Daniel P. Martin ◽  
Andreas M. Brandmaier ◽  
Timo von Oertzen

Missing data are ubiquitous in psychological research. They may come about as an unwanted result of coding or computer error, participants' non-response or absence, or missing values may be intentional, as in planned missing designs. We discuss the effects of missing data on χ²-based goodness-of-fit indices in Structural Equation Modeling (SEM), specifically on the Root Mean Squared Error of Approximation (RMSEA). We use simulations to show that naive implementations of the RMSEA have a downward bias in the presence of missing data and, thus, overestimate model goodness-of-fit. Unfortunately, many state-of-the-art software packages report the biased form of RMSEA. As a consequence, the scientific community may have been accepting a much larger fraction of models with non-acceptable model fit. We propose a bias-correction for the RMSEA based on information-theoretic considerations that take into account the expected misfit of a person with fully observed data. The corrected RMSEA is asymptotically independent of the proportion of missing data for misspecified models. Importantly, results of the corrected RMSEA computation are identical to naive RMSEA if there are no missing data.


2019 ◽  
Vol 79 (3) ◽  
pp. 558-576 ◽  
Author(s):  
Alexandra De Raadt ◽  
Matthijs J. Warrens ◽  
Roel J. Bosker ◽  
Henk A. L. Kiers

Cohen’s kappa coefficient is commonly used for assessing agreement between classifications of two raters on a nominal scale. Three variants of Cohen’s kappa that can handle missing data are presented. Data are considered missing if one or both ratings of a unit are missing. We study how well the variants estimate the kappa value for complete data under two missing data mechanisms—namely, missingness completely at random and a form of missingness not at random. The kappa coefficient considered in Gwet ( Handbook of Inter-rater Reliability, 4th ed.) and the kappa coefficient based on listwise deletion of units with missing ratings were found to have virtually no bias and mean squared error if missingness is completely at random, and small bias and mean squared error if missingness is not at random. Furthermore, the kappa coefficient that treats missing ratings as a regular category appears to be rather heavily biased and has a substantial mean squared error in many of the simulations. Because it performs well and is easy to compute, we recommend to use the kappa coefficient that is based on listwise deletion of missing ratings if it can be assumed that missingness is completely at random or not at random.


Forests ◽  
2020 ◽  
Vol 11 (6) ◽  
pp. 634 ◽  
Author(s):  
Piotr Pogoda ◽  
Wojciech Ochał ◽  
Stanisław Orzeł

We compare the usefulness of nonparametric and parametric methods of diameter distribution modeling. The nonparametric method was represented by the new tool—kernel estimator of cumulative distribution function with bandwidths of 1 cm (KE1), 2 cm (KE2), and bandwidth obtained automatically (KEA). Johnson SB (JSB) function was used for the parametric method. The data set consisted of 7867 measurements made at breast height in 360 sample plots established in 36 managed black alder (Alnus glutinosa (L.) Gaertn.) stands located in southeastern Poland. The model performance was assessed using leave-one-plot-out cross-validation and goodness-of-fit measures: mean error, root mean squared error, Kolmogorov–Smirnov, and Anderson–Darling statistics. The model based on KE1 revealed a good fit to diameters forming training sets. A poor fit was observed for KEA. Frequency of diameters forming test sets were properly fitted by KEA and poorly by KE1. KEA develops more general models that can be used for the approximation of independent data sets. Models based on KE1 adequately fit local irregularities in diameter frequency, which may be considered as an advantageous in some situations and as a drawback in other conditions due to the risk of model overfitting. The application of the JSB function to training sets resulted in the worst fit among the developed models. The performance of the parametric method used to test sets varied depending on the criterion used. Similar to KEA, the JSB function gives more general models that emphasize the rough shape of the approximated distribution. Site type and stand age do not affect the fit of nonparametric models. The JSB function show slightly better fit in older stands. The differences between the average values of Kolmogorov–Smirnov (KS), Anderson–Darling (AD), and root mean squared error (RMSE) statistics calculated for models developed with test sets were statistically nonsignificant, which indicates the similar usefulness of the investigated methods for modeling diameter distribution.


Author(s):  
Charles F. Manski ◽  
Max Tabord-Meehan

In this article, we present the wald_mse command, which computes the maximum mean squared error of a user-specified point estimator of the mean for a population of interest in the presence of missing data. As pointed out by Manski (1989, Journal of Human Resources 24: 343–360; 2007, Journal of Econometrics 139: 105–115), the presence of missing data results in the loss of point identification of the mean unless one is willing to make strong assumptions about the nature of the missing data. Despite this, decision makers may be interested in reporting a single number as their estimate of the mean as opposed to an estimate of the identified set. It is not obvious which estimator of the mean is best suited to this task, and there may not exist a universally best choice in all settings. To evaluate the performance of a given point estimator of the mean, wald_mse allows the decision maker to compute the maximum mean squared error of an arbitrary estimator under a flexible specification of the missing-data process.


2018 ◽  
Vol 28 (5) ◽  
pp. 1311-1327 ◽  
Author(s):  
Faisal M Zahid ◽  
Christian Heumann

Missing data is a common issue that can cause problems in estimation and inference in biomedical, epidemiological and social research. Multiple imputation is an increasingly popular approach for handling missing data. In case of a large number of covariates with missing data, existing multiple imputation software packages may not work properly and often produce errors. We propose a multiple imputation algorithm called mispr based on sequential penalized regression models. Each variable with missing values is assumed to have a different distributional form and is imputed with its own imputation model using the ridge penalty. In the case of a large number of predictors with respect to the sample size, the use of a quadratic penalty guarantees unique estimates for the parameters and leads to better predictions than the usual Maximum Likelihood Estimation (MLE), with a good compromise between bias and variance. As a result, the proposed algorithm performs well and provides imputed values that are better even for a large number of covariates with small samples. The results are compared with the existing software packages mice, VIM and Amelia in simulation studies. The missing at random mechanism was the main assumption in the simulation study. The imputation performance of the proposed algorithm is evaluated with mean squared imputation error and mean absolute imputation error. The mean squared error ([Formula: see text]), parameter estimates with their standard errors and confidence intervals are also computed to compare the performance in the regression context. The proposed algorithm is observed to be a good competitor to the existing algorithms, with smaller mean squared imputation error, mean absolute imputation error and mean squared error. The algorithm’s performance becomes considerably better than that of the existing algorithms with increasing number of covariates, especially when the number of predictors is close to or even greater than the sample size. Two real-life datasets are also used to examine the performance of the proposed algorithm using simulations.


Sign in / Sign up

Export Citation Format

Share Document