Treatment of missing data determines conclusions regarding moralizing gods

Whitehouse, et al. have recently used the Seshat archaeo-historical databank to argue that beliefs in moralizing gods appear in world history only after the formation of complex “megasocieties” of around one million people. Inspection of the authors’ data, however, shows that 61% of Seshat data points on moralizing gods are missing values, mostly from smaller populations below one million people, and during the analysis the authors re-coded these data points to signify the absence of moralizing gods beliefs. When we confine the analysis only to the extant data or use various standard imputation methods, the reported finding is reversed: moralizing gods precede increases in social complexity. We suggest that the reported “megasociety threshold” for the emergence of moralizing gods is thus solely a consequence of the decision to re-code nearly two-thirds of Seshat data from unknown values to known absences of moralizing gods.

Download Full-text

Application of imputation methods for missing values of PM10 and O3 data: Interpolation, moving average and K-nearest neighbor methods

Environmental Health Engineering and Management ◽

10.34172/ehem.2021.25 ◽

2021 ◽

Vol 8 (3) ◽

pp. 215-226

Author(s):

Parisa Saeipourdizaj ◽

Parvin Sarbakhsh ◽

Akbar Gholampour

Keyword(s):

Missing Data ◽

Human Error ◽

Missing Values ◽

Nearest Neighbor ◽

Moving Average ◽

Classification And Regression Tree ◽

Coefficient Of Determination ◽

K Nearest Neighbor ◽

Imputation Methods ◽

Machine Failure

Background: PIn air quality studies, it is very often to have missing data due to reasons such as machine failure or human error. The approach used in dealing with such missing data can affect the results of the analysis. The main aim of this study was to review the types of missing mechanism, imputation methods, application of some of them in imputation of missing of PM10 and O3 in Tabriz, and compare their efficiency. Methods: Methods of mean, EM algorithm, regression, classification and regression tree, predictive mean matching (PMM), interpolation, moving average, and K-nearest neighbor (KNN) were used. PMM was investigated by considering the spatial and temporal dependencies in the model. Missing data were randomly simulated with 10, 20, and 30% missing values. The efficiency of methods was compared using coefficient of determination (R2 ), mean absolute error (MAE) and root mean square error (RMSE). Results: Based on the results for all indicators, interpolation, moving average, and KNN had the best performance, respectively. PMM did not perform well with and without spatio-temporal information. Conclusion: Given that the nature of pollution data always depends on next and previous information, methods that their computational nature is based on before and after information indicated better performance than others, so in the case of pollutant data, it is recommended to use these methods.

Download Full-text

Missing Data: Current Practice in Football Research and Recommendations for Improvement

10.31236/osf.io/fhwcu ◽

2020 ◽

Author(s):

David N Borg ◽

Robert Nguyen ◽

Nicholas J Tierney

Keyword(s):

Missing Data ◽

Missing Values ◽

Current Practice ◽

Exploratory Analysis ◽

Imputation Methods ◽

Research Results ◽

Study Results

Missing data are often unavoidable. The reason values go missing, along with decisions made of how missing data are handled (deleted or imputed), can have a profound effect on the validity and accuracy of study results. In this article, we aimed to: estimate the proportion of studies in football research that included a missing data statement, highlight several practices to avoid in relation to missing data, and provide recommendations for exploring, visualising and reporting missingness. Football related articles, published in 2019 were studied. A survey of 136 articles, sampled at random, was conducted to determine whether a missing data statement was included. As expected, the proportion of studies in football research that included a missing data statement was low, at only 11.0% (95% CI: 6.3% to 17.5%); suggesting that missingness is seldom considered by researchers. We recommend that researchers describe the number and percentage of missing values, including when there are no missing values. Exploratory analysis should be conducted to explore missing values, and visualisations describing missingness overall should be provided in the paper, or at least supplementary materials. Missing values should almost always be imputed, and imputation methods should be explored to ensure they are appropriately representative. Researchers should consider these recommendations, and pay greater attention to missing data and its influence on research results.

Download Full-text

Comparison of Selected Multiple Imputation Methods for Continuous Variables – Preliminary Simulation Study Results

Acta Universitatis Lodziensis Folia oeconomica ◽

10.18778/0208-6018.339.05 ◽

2019 ◽

Vol 6 (339) ◽

pp. 73-98

Author(s):

Małgorzata Aleksandra Misztal

Keyword(s):

Missing Data ◽

Multiple Imputation ◽

Missing Values ◽

Imputation Accuracy ◽

Imputation Method ◽

Data Sets ◽

Continuous Variables ◽

Imputation Methods ◽

Study Results ◽

Almost All

The problem of incomplete data and its implications for drawing valid conclusions from statistical analyses is not related to any particular scientific domain, it arises in economics, sociology, education, behavioural sciences or medicine. Almost all standard statistical methods presume that every object has information on every variable to be included in the analysis and the typical approach to missing data is simply to delete them. However, this leads to ineffective and biased analysis results and is not recommended in the literature. The state of the art technique for handling missing data is multiple imputation. In the paper, some selected multiple imputation methods were taken into account. Special attention was paid to using principal components analysis (PCA) as an imputation method. The goal of the study was to assess the quality of PCA‑based imputations as compared to two other multiple imputation techniques: multivariate imputation by chained equations (MICE) and missForest. The comparison was made by artificially simulating different proportions (10–50%) and mechanisms of missing data using 10 complete data sets from the UCI repository of machine learning databases. Then, missing values were imputed with the use of MICE, missForest and the PCA‑based method (MIPCA). The normalised root mean square error (NRMSE) was calculated as a measure of imputation accuracy. On the basis of the conducted analyses, missForest can be recommended as a multiple imputation method providing the lowest rates of imputation errors for all types of missingness. PCA‑based imputation does not perform well in terms of accuracy.

Download Full-text

Characterizing and Managing Missing Structured Data in Electronic Health Records: Data Analysis (Preprint)

10.2196/preprints.8960 ◽

2017 ◽

Cited By ~ 1

Author(s):

Brett K Beaulieu-Jones ◽

Daniel R Lavage ◽

John W Snyder ◽

Jason H Moore ◽

Sarah A Pendergrass ◽

...

Keyword(s):

Missing Data ◽

Missing Values ◽

Clinical Laboratory ◽

Real Data ◽

Missing At Random ◽

Theoretical Work ◽

Structured Data ◽

Data Types ◽

Imputation Methods ◽

Electronic Health

BACKGROUND Missing data is a challenge for all studies; however, this is especially true for electronic health record (EHR)-based analyses. Failure to appropriately consider missing data can lead to biased results. While there has been extensive theoretical work on imputation, and many sophisticated methods are now available, it remains quite challenging for researchers to implement these methods appropriately. Here, we provide detailed procedures for when and how to conduct imputation of EHR laboratory results. OBJECTIVE The objective of this study was to demonstrate how the mechanism of missingness can be assessed, evaluate the performance of a variety of imputation methods, and describe some of the most frequent problems that can be encountered. METHODS We analyzed clinical laboratory measures from 602,366 patients in the EHR of Geisinger Health System in Pennsylvania, USA. Using these data, we constructed a representative set of complete cases and assessed the performance of 12 different imputation methods for missing data that was simulated based on 4 mechanisms of missingness (missing completely at random, missing not at random, missing at random, and real data modelling). RESULTS Our results showed that several methods, including variations of Multivariate Imputation by Chained Equations (MICE) and softImpute, consistently imputed missing values with low error; however, only a subset of the MICE methods was suitable for multiple imputation. CONCLUSIONS The analyses we describe provide an outline of considerations for dealing with missing EHR data, steps that researchers can perform to characterize missingness within their own data, and an evaluation of methods that can be applied to impute clinical data. While the performance of methods may vary between datasets, the process we describe can be generalized to the majority of structured data types that exist in EHRs, and all of our methods and code are publicly available.

Download Full-text

Futuristic Prediction of Missing Value Imputation Methods Using Extended ANN

International Journal of Business Analytics ◽

10.4018/ijban.292055 ◽

2022 ◽

Vol 9 (3) ◽

pp. 0-0

Keyword(s):

Data Analysis ◽

Missing Data ◽

Measurement Errors ◽

Missing Values ◽

Missing Value ◽

Hybrid Schemes ◽

Imputation Methods ◽

Research Fields ◽

Data Missing ◽

The Given

Missing data is universal complexity for most part of the research fields which introduces the part of uncertainty into data analysis. We can take place due to many types of motives such as samples mishandling, unable to collect an observation, measurement errors, aberrant value deleted, or merely be short of study. The nourishment area is not an exemption to the difficulty of data missing. Most frequently, this difficulty is determined by manipulative means or medians from the existing datasets which need improvements. The paper proposed hybrid schemes of MICE and ANN known as extended ANN to search and analyze the missing values and perform imputations in the given dataset. The proposed mechanism is efficiently able to analyze the blank entries and fill them with proper examining their neighboring records in order to improve the accuracy of the dataset. In order to validate the proposed scheme, the extended ANN is further compared against various recent algorithms or mechanisms to analyze the efficiency as well as the accuracy of the results.

Download Full-text

Multiple Imputation for Missing Values in Homicide Incident Data: An Evaluation Using Unique Test Data

Homicide Studies ◽

10.1177/1088767918778309 ◽

2018 ◽

Vol 22 (4) ◽

pp. 391-409

Author(s):

John M. Roberts ◽

Aki Roberts ◽

Tim Wadsworth

Keyword(s):

Missing Data ◽

Multiple Imputation ◽

Missing Values ◽

Actual Data ◽

Regression Coefficients ◽

Similar Data ◽

Missing Information ◽

Imputation Methods ◽

Unique Data ◽

Incident Reports

Incident-level homicide datasets such as the Supplementary Homicide Reports (SHR) commonly exhibit missing data. We evaluated multiple imputation methods (that produce multiple completed datasets, across which imputed values may vary) via unique data that included actual values, from police agency incident reports, of seemingly missing SHR data. This permitted evaluation under a real, not assumed or simulated, missing data mechanism. We compared analytic results based on multiply imputed and actual data; multiple imputation rather successfully recovered victim–offender relationship distributions and regression coefficients that hold in the actual data. Results are encouraging for users of multiple imputation, though it is still important to minimize the extent of missing information in SHR and similar data.

Download Full-text

CBRL and CBRC: Novel Algorithms for Improving Missing Value Imputation Accuracy Based on Bayesian Ridge Regression

Symmetry ◽

10.3390/sym12101594 ◽

2020 ◽

Vol 12 (10) ◽

pp. 1594

Author(s):

Samih M. Mostafa ◽

Abdelrahman S. Eladimy ◽

Safwat Hamad ◽

Hirofumi Amano

Keyword(s):

Missing Data ◽

Missing Values ◽

Mean Absolute Error ◽

Imputation Accuracy ◽

Absolute Error ◽

Coefficient Of Determination ◽

Mean Square ◽

Critical Problem ◽

Imputation Methods ◽

Novel Algorithms

In most scientific studies such as data analysis, the existence of missing data is a critical problem, and selecting the appropriate approach to deal with missing data is a challenge. In this paper, the authors perform a fair comparative study of some practical imputation methods used for handling missing values against two proposed imputation algorithms. The proposed algorithms depend on the Bayesian Ridge technique under two different feature selection conditions. The proposed algorithms differ from the existing approaches in that they cumulate the imputed features; those imputed features will be incorporated within the Bayesian Ridge equation for predicting the missing values in the next incomplete selected feature. The authors applied the proposed algorithms on eight datasets with different amount of missing values created from different missingness mechanisms. The performance was measured in terms of imputation time, root-mean-square error (RMSE), coefficient of determination (R2), and mean absolute error (MAE). The results showed that the performance varies depending on missing values percentage, size of the dataset, and the missingness mechanism. In addition, the performance of the proposed methods is slightly better.

Download Full-text

Filling the Missing Data of Air Pollutant Concentration Using Single Imputation Methods

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.754-755.923 ◽

2015 ◽

Vol 754-755 ◽

pp. 923-932 ◽

Cited By ~ 2

Author(s):

Norazian Mohamed Noor ◽

A.S. Yahaya ◽

N.A. Ramli ◽

Mohd Mustafa Al Bakri Abdullah

Keyword(s):

Missing Data ◽

Missing Values ◽

Mean Squared Error ◽

Absolute Error ◽

Linear Interpolation ◽

Peninsular Malaysia ◽

Air Pollutant ◽

Imputation Methods ◽

Single Imputation ◽

Averaging Time

Hourly measured PM10 concentration at eight monitoring stations within peninsular Malaysia in 2006 was used to conduct the simulated missing data. The gap lengths of the simulated missing values are limited to 12 hours since the actual trend of missingness is considered short. Two percentages of simulated missing gaps were generated that are 5 % and 15 %. A number of single imputation methods (linear interpolation (LI), nearest neighbour interpolation (NN), mean above below (MAB), daily mean (DM), mean 12-hour (12M), mean 6-hour (6M), row mean (RM) and previous year (PY)) were calculated to fill in the simulated missing data. In addition, multiple imputation (MI) was also conducted to compare between the single imputation methods. The performances were evaluated using four statistical criteria namely mean absolute error, root mean squared error, prediction accuracy and index of agreement. The results show that 6M perform comparably well to LI. Thus, this show that the effect of smaller averaging time gives better prediction. Other single imputation methods predict the missing data well except for PY. RM and MI performs moderately with the increasing performance in higher fraction of missing gaps whereas LR makes the worst methods for both simulated missing data percentages.

Download Full-text

Characterization of missing values in untargeted MS-based metabolomics data and evaluation of missing data handling strategies

10.1101/260281 ◽

2018 ◽

Cited By ~ 2

Author(s):

Kieu Trinh Do ◽

Simone Wahl ◽

Johannes Raffler ◽

Sophie Molnos ◽

Michael Laimighofer ◽

...

Keyword(s):

Missing Data ◽

Multiple Imputation ◽

Statistical Power ◽

Missing Values ◽

Biological Evaluation ◽

List Type ◽

Robust Performance ◽

Metabolomics Data ◽

Imputation Methods ◽

Biochemical Pathways

AbstractBACKGROUNDUntargeted mass spectrometry (MS)-based metabolomics data often contain missing values that reduce statistical power and can introduce bias in epidemiological studies. However, a systematic assessment of the various sources of missing values and strategies to handle these data has received little attention. Missing data can occur systematically, e.g. from run day-dependent effects due to limits of detection (LOD); or it can be random as, for instance, a consequence of sample preparation.METHODSWe investigated patterns of missing data in an MS-based metabolomics experiment of serum samples from the German KORA F4 cohort (n = 1750). We then evaluated 31 imputation methods in a simulation framework and biologically validated the results by applying all imputation approaches to real metabolomics data. We examined the ability of each method to reconstruct biochemical pathways from data-driven correlation networks, and the ability of the method to increase statistical power while preserving the strength of established genetically metabolic quantitative trait loci.RESULTSRun day-dependent LOD-based missing data accounts for most missing values in the metabolomics dataset. Although multiple imputation by chained equations (MICE) performed well in many scenarios, it is computationally and statistically challenging. K-nearest neighbors (KNN) imputation on observations with variable pre-selection showed robust performance across all evaluation schemes and is computationally more tractable.CONCLUSIONMissing data in untargeted MS-based metabolomics data occur for various reasons. Based on our results, we recommend that KNN-based imputation is performed on observations with variable pre-selection since it showed robust results in all evaluation schemes.Key messagesUntargeted MS-based metabolomics data show missing values due to both batch-specific LOD-based and non-LOD-based effects.Statistical evaluation of multiple imputation methods was conducted on both simulated and real datasets.Biological evaluation on real data assessed the ability of imputation methods to preserve statistical inference of biochemical pathways and correctly estimate effects of genetic variants on metabolite levels.KNN-based imputation on observations with variable pre-selection and K = 10 showed robust performance for all data scenarios across all evaluation schemes.

Download Full-text

Imputation Methods for Missing Data for a Proposed VASA Dataset

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.a5204.119119 ◽

2019 ◽

Vol 9 (1) ◽

pp. 1950-1953

Keyword(s):

Missing Data ◽

Mean Square Error ◽

Missing Values ◽

Evaluation Criteria ◽

Principal Component ◽

Mean Square ◽

K Nearest Neighbors ◽

Imputation Methods ◽

Initial Dataset ◽

Value Decomposition

Preprocessing is the presentation of raw data before apply the actual statistical method. Data preprocessing is one of the most vital steps in data mining process and it deals with the preparation and transformation of the initial dataset. It is prominent because the investigating data which is not properly preprocessed could lead to the result which is not accurate and meaningless. Almost every research have missing data and introduce an element into data analysis using some method. To consider the missing values that need to provide an efficient and valid analysis. Missing imputation is one of the process in data cleaning. Here, four different types of imputation methods are compared: Mean, Singular Value Decomposition (SVD), K-Nearest Neighbors (KNN), Bayesian Principal Component Analysis (BPCA). Comparison was performed in the real VASA dataset and based on performance evaluation criteria such as Mean Square Error (MSE) and Root Mean Square Error (RMSE). BPCA is the best imputation method of interest which deserve further consideration in practice.

Download Full-text