scholarly journals Challenging the Raunkiaeran shortfall and the consequences of using imputed databases

2016 ◽  
Author(s):  
Lucas Jardim ◽  
Luis Mauricio Bini ◽  
José Alexandre Felizola Diniz-Filho ◽  
Fabricio Villalobos

SummaryGiven the prevalence of missing data on species’ traits – Raunkiaeran shorfall — and its importance for theoretical and empirical investigations, several methods have been proposed to fill sparse databases. Despite its advantages, imputation of missing data can introduce biases. Here, we evaluate the bias in descriptive statistics, model parameters, and phylogenetic signal estimation from imputed databases under different missing and imputing scenarios.We simulated coalescent phylogenies and traits under Brownian Motion and different Ornstein-Uhlenbeck evolutionary models. Missing values were created using three scenarios: missing completely at random, missing at random but phylogenetically structured and missing at random but correlated with some other variable. We considered four methods for handling missing data: delete missing values, imputation based on observed mean trait value, Phylogenetic Eigenvectors Maps and Multiple Imputation by Chained Equations. Finally, we assessed estimation errors of descriptive statistics (mean, variance), regression coefficient, Moran’s correlogram and Blomberg’s K of imputed traits.We found that percentage of missing data, missing mechanisms, Ornstein-Uhlenbeck strength and handling methods were important to define estimation errors. When data were missing completely at random, descriptive statistics were well estimated but Moran’s correlogram and Blomberg’s K were not well estimated, depending on handling methods. We also found that handling methods performed worse when data were missing at random, but phylogenetically structured. In this case adding phylogenetic information provided better estimates. Although the error caused by imputation was correlated with estimation errors, we found that such relationship is not linear with estimation errors getting larger as the imputation error increases.Imputed trait databases could bias ecological and evolutionary analyses. We advise researchers to share their raw data along with their imputed database, flagging imputed data and providing information on the imputation process. Thus, users can and should consider the pattern of missing data and then look for the best method to overcome this problem. In addition, we suggest the development of phylogenetic methods that consider imputation uncertainty, phylogenetic autocorrelation and preserve the level of phylogenetic signal of the original data.

2012 ◽  
Vol 57 (1) ◽  
Author(s):  
HO MING KANG ◽  
FADHILAH YUSOF ◽  
ISMAIL MOHAMAD

This paper presents a study on the estimation of missing data. Data samples with different missingness mechanism namely Missing Completely At Random (MCAR), Missing At Random (MAR) and Missing Not At Random (MNAR) are simulated accordingly. Expectation maximization (EM) algorithm and mean imputation (MI) are applied to these data sets and compared and the performances are evaluated by the mean absolute error (MAE) and root mean square error (RMSE). The results showed that EM is able to estimate the missing data with minimum errors compared to mean imputation (MI) for the three missingness mechanisms. However the graphical results showed that EM failed to estimate the missing values in the missing quadrants when the situation is MNAR.


Author(s):  
Roderick J. Little

I review assumptions about the missing-data mechanism that underlie methods for the statistical analysis of data with missing values. I describe Rubin's original definition of missing at random, (MAR), its motivation and criticisms, and his sufficient conditions for ignoring the missingness mechanism for likelihood-based, Bayesian, and frequentist inference. Related definitions, including missing completely at random, always MAR, always missing completely at random, and partially MAR are also covered. I present a formal argument for weakening Rubin's sufficient conditions for frequentist maximum likelihood inference with precision based on the observed information. Some simple examples of MAR are described, together with an example where the missingness mechanism can be ignored even though MAR does not hold. Alternative approaches to statistical inference based on the likelihood function are reviewed, along with non-likelihood frequentist approaches, including weighted generalized estimating equations. Connections with the causal inference literature are also discussed. Finally, alternatives to Rubin's MAR definition are discussed, including informative missingness, informative censoring, and coarsening at random. The intent is to provide a relatively nontechnical discussion, although some of the underlying issues are challenging and touch on fundamental questions of statistical inference. Expected final online publication date for the Annual Review of Statistics, Volume 8 is March 7, 2021. Please see http://www.annualreviews.org/page/journal/pubdates for revised estimates.


2020 ◽  
Vol 28 (108) ◽  
pp. 599-621
Author(s):  
Maria Eugénia Ferrão ◽  
Paula Prata ◽  
Maria Teresa Gonzaga Alves

Abstract Almost all quantitative studies in educational assessment, evaluation and educational research are based on incomplete data sets, which have been a problem for years without a single solution. The use of big identifiable data poses new challenges in dealing with missing values. In the first part of this paper, we present the state-of-art of the topic in the Brazilian education scientific literature, and how researchers have dealt with missing data since the turn of the century. Next, we use open access software to analyze real-world data, the 2017 Prova Brasil , for several federation units to document how the naïve assumption of missing completely at random may substantially affect statistical conclusions, researcher interpretations, and subsequent implications for policy and practice. We conclude with straightforward suggestions for any education researcher on applying R routines to conduct the hypotheses test of missing completely at random and, if the null hypothesis is rejected, then how to implement the multiple imputation, which appears to be one of the most appropriate methods for handling missing data.


Author(s):  
Mohamed Reda Abonazel

This paper has reviewed two important problems in regression analysis (outliers and missing data), as well as some handling methods for these problems. Moreover, two applications have been introduced to understand and study these methods by R-codes. Practical evidence was provided to researchers to deal with those problems in regression modeling with R. Finally, we created a Monte Carlo simulation study to compare different handling methods of missing data in the regression model. Simulation results indicate that, under our simulation factors, the k-nearest neighbors method is the best method to estimate the missing values in regression models.


Author(s):  
Brett K Beaulieu-Jones ◽  
Daniel R Lavage ◽  
John W Snyder ◽  
Jason H Moore ◽  
Sarah A Pendergrass ◽  
...  

BACKGROUND Missing data is a challenge for all studies; however, this is especially true for electronic health record (EHR)-based analyses. Failure to appropriately consider missing data can lead to biased results. While there has been extensive theoretical work on imputation, and many sophisticated methods are now available, it remains quite challenging for researchers to implement these methods appropriately. Here, we provide detailed procedures for when and how to conduct imputation of EHR laboratory results. OBJECTIVE The objective of this study was to demonstrate how the mechanism of missingness can be assessed, evaluate the performance of a variety of imputation methods, and describe some of the most frequent problems that can be encountered. METHODS We analyzed clinical laboratory measures from 602,366 patients in the EHR of Geisinger Health System in Pennsylvania, USA. Using these data, we constructed a representative set of complete cases and assessed the performance of 12 different imputation methods for missing data that was simulated based on 4 mechanisms of missingness (missing completely at random, missing not at random, missing at random, and real data modelling). RESULTS Our results showed that several methods, including variations of Multivariate Imputation by Chained Equations (MICE) and softImpute, consistently imputed missing values with low error; however, only a subset of the MICE methods was suitable for multiple imputation. CONCLUSIONS The analyses we describe provide an outline of considerations for dealing with missing EHR data, steps that researchers can perform to characterize missingness within their own data, and an evaluation of methods that can be applied to impute clinical data. While the performance of methods may vary between datasets, the process we describe can be generalized to the majority of structured data types that exist in EHRs, and all of our methods and code are publicly available.


2020 ◽  
Vol 4 (Supplement_1) ◽  
pp. 509-509
Author(s):  
Peiyi Lu ◽  
Mack Shelley

Abstract Studies using data from longitudinal health survey of older adults usually assumed the data were missing completely at random (MCAR) or missing at random (MAR). Thus subsequent analyses used multiple imputation or likelihood-based method to handle missing data. However, little existing research actually examines whether the data met the MCAR/MAR assumptions before performing data analyses. This study first summarized the commonly used statistical methods to test missing mechanism and discussed their application conditions. Then using two-wave longitudinal data from the Health and Retirement Study (HRS; wave 2014-2015 and wave 2016-2017; N=18,747), this study applied different approaches to test the missing mechanism of several demographic and health variables. These approaches included Little’s test, logistic regression method, nonparametric tests, false discovery rate, and others. Results indicated the data did not meet the MCAR assumption even though they had a very low rate of missing values. Demographic variables provided good auxiliary information for health variables. Health measures (e.g., self-reported health, activity of daily life, depressive symptoms) met the MAR assumptions. Older respondents could drop out and die in the longitudinal survey, but attrition did not significantly affect the MAR assumption. Our findings supported the MAR assumptions for the demographic and health variables in HRS, and therefore provided statistical justification to HRS researchers about using imputation or likelihood-based methods to deal with missing data. However, researchers are strongly encouraged to test the missing mechanism of the specific variables/data they choose when using a new dataset.


Author(s):  
Daniele Bottigliengo ◽  
Giulia Lorenzoni ◽  
Honoria Ocagli ◽  
Matteo Martinato ◽  
Paola Berchialla ◽  
...  

(1) Background: Propensity score methods gained popularity in non-interventional clinical studies. As it may often occur in observational datasets, some values in baseline covariates are missing for some patients. The present study aims to compare the performances of popular statistical methods to deal with missing data in propensity score analysis. (2) Methods: Methods that account for missing data during the estimation process and methods based on the imputation of missing values, such as multiple imputations, were considered. The methods were applied on the dataset of an ongoing prospective registry for the treatment of unprotected left main coronary artery disease. The performances were assessed in terms of the overall balance of baseline covariates. (3) Results: Methods that explicitly deal with missing data were superior to classical complete case analysis. The best balance was observed when propensity scores were estimated with a method that accounts for missing data using a stochastic approximation of the expectation-maximization algorithm. (4) Conclusions: If missing at random mechanism is plausible, methods that use missing data to estimate propensity score or impute them should be preferred. Sensitivity analyses are encouraged to evaluate the implications methods used to handle missing data and estimate propensity score.


2018 ◽  
Author(s):  
Kieu Trinh Do ◽  
Simone Wahl ◽  
Johannes Raffler ◽  
Sophie Molnos ◽  
Michael Laimighofer ◽  
...  

AbstractBACKGROUNDUntargeted mass spectrometry (MS)-based metabolomics data often contain missing values that reduce statistical power and can introduce bias in epidemiological studies. However, a systematic assessment of the various sources of missing values and strategies to handle these data has received little attention. Missing data can occur systematically, e.g. from run day-dependent effects due to limits of detection (LOD); or it can be random as, for instance, a consequence of sample preparation.METHODSWe investigated patterns of missing data in an MS-based metabolomics experiment of serum samples from the German KORA F4 cohort (n = 1750). We then evaluated 31 imputation methods in a simulation framework and biologically validated the results by applying all imputation approaches to real metabolomics data. We examined the ability of each method to reconstruct biochemical pathways from data-driven correlation networks, and the ability of the method to increase statistical power while preserving the strength of established genetically metabolic quantitative trait loci.RESULTSRun day-dependent LOD-based missing data accounts for most missing values in the metabolomics dataset. Although multiple imputation by chained equations (MICE) performed well in many scenarios, it is computationally and statistically challenging. K-nearest neighbors (KNN) imputation on observations with variable pre-selection showed robust performance across all evaluation schemes and is computationally more tractable.CONCLUSIONMissing data in untargeted MS-based metabolomics data occur for various reasons. Based on our results, we recommend that KNN-based imputation is performed on observations with variable pre-selection since it showed robust results in all evaluation schemes.Key messagesUntargeted MS-based metabolomics data show missing values due to both batch-specific LOD-based and non-LOD-based effects.Statistical evaluation of multiple imputation methods was conducted on both simulated and real datasets.Biological evaluation on real data assessed the ability of imputation methods to preserve statistical inference of biochemical pathways and correctly estimate effects of genetic variants on metabolite levels.KNN-based imputation on observations with variable pre-selection and K = 10 showed robust performance for all data scenarios across all evaluation schemes.


2021 ◽  
Author(s):  
Trenton J. Davis ◽  
Tarek R. Firzli ◽  
Emily A. Higgins Keppler ◽  
Matt Richardson ◽  
Heather D. Bean

Missing data is a significant issue in metabolomics that is often neglected when conducting data pre-processing, particularly when it comes to imputation. This can have serious implications for downstream statistical analyses and lead to misleading or uninterpretable inferences. In this study, we aim to identify the primary types of missingness that affect untargeted metab-olomics data and compare strategies for imputation using two real-world comprehensive two-dimensional gas chromatog-raphy (GC×GC) data sets. We also present these goals in the context of experimental replication whereby imputation is con-ducted in a within-replicate-based fashion—the first description and evaluation of this strategy—and introduce an R package MetabImpute to carry out these analyses. Our results conclude that, in these two data sets, missingness was most likely of the missing at-random (MAR) and missing not-at-random (MNAR) types as opposed to missing completely at-random (MCAR). Gibbs sampler imputation and Random Forest gave the best results when imputing MAR and MNAR compared against single-value imputation (zero, minimum, mean, median, and half-minimum) and other more sophisticated approach-es (Bayesian principal components analysis and quantile regression imputation for left-censored data). When samples are replicated, within-replicate imputation approaches led to an increase in the reproducibility of peak quantification compared to imputation that ignores replication, suggesting that imputing with respect to replication may preserve potentially im-portant features in downstream analyses for biomarker discovery.


2020 ◽  
Author(s):  
Pietro Di Lena ◽  
Claudia Sala ◽  
Andrea Prodi ◽  
Christine Nardini

Abstract Background: High-throughput technologies enable the cost-effective collection and analysis of DNA methylation data throughout the human genome. This naturally entails missing values management that can complicate the analysis of the data. Several general and specific imputation methods are suitable for DNA methylation data. However, there are no detailed studies of their performances under different missing data mechanisms -(completely) at random or not- and different representations of DNA methylation levels (β and M-value). Results: We make an extensive analysis of the imputation performances of seven imputation methods on simulated missing completely at random (MCAR), missing at random (MAR) and missing not at random (MNAR) methylation data. We further consider imputation performances on the β- and M-value popular representations of methylation levels. Overall, β -values enable better imputation performances than M-values. Imputation accuracy is lower for mid-range β -values, while it is generally more accurate for values at the extremes of the β -value range. The MAR values distribution is on the average more dense in the mid-range in comparison to the expected β -value distribution. As a consequence, MAR values are on average harder to impute. Conclusions: The results of the analysis provide guidelines for the most suitable imputation approaches for DNA methylation data under different representations of DNA methylation levels and different missing data mechanisms.


Sign in / Sign up

Export Citation Format

Share Document