scholarly journals Missing data: the impact of what is not there

2020 ◽  
Vol 183 (4) ◽  
pp. E7-E9
Author(s):  
Rolf H H Groenwold ◽  
Olaf M Dekkers

The validity of clinical research is potentially threatened by missing data. Any variable measured in a study can have missing values, including the exposure, the outcome, and confounders. When missing values are ignored in the analysis, only those subjects with complete records will be included in the analysis. This may lead to biased results and loss of power. We explain why missing data may lead to bias and discuss a commonly used classification of missing data.

2021 ◽  
Author(s):  
Markus Deppner ◽  
Bedartha Goswami

<p>The impact of the El Niño Southern Oscillation (ENSO) on rivers are well known, but most existing studies involving streamflow data are severely limited by data coverage. Time series of gauging stations fade in and out over time, which makes hydrological large scale and long time analysis or studies of rarely occurring extreme events challenging. Here, we use a machine learning approach to infer missing streamflow data based on temporal correlations of stations with missing values to others with data. By using 346 stations, from the “Global Streamflow Indices and Metadata archive” (GSIM), that initially cover the 40 year timespan in conjunction with Gaussian processes we were able to extend our data by estimating missing data for an additional 646 stations, allowing us to include a total of 992 stations. We then investigate the impact of the 6 strongest El Niño (EN) events on rivers in South America between 1960 and 2000. Our analysis shows a strong correlation between ENSO events and extreme river dynamics in the southeast of Brazil, Carribean South America and parts of the Amazon basin. Furthermore we see a peak in the number of stations showing maximum river discharge all over Brazil during the EN of 1982/83 which has been linked to severe floods in the east of Brazil, parts of Uruguay and Paraguay. However EN events in other years with similar intensity did not evoke floods with such magnitude and therefore the additional drivers of the 1982/83  floods need further investigation. By using machine learning methods to infer data for gauging stations with missing data we were able to extend our data by almost three-fold, revealing a possible heavier and spatially larger impact of the 1982/83 EN on South America's hydrology than indicated in literature.</p>


2020 ◽  
Vol 07 (02) ◽  
pp. 161-177
Author(s):  
Oyekale Abel Alade ◽  
Ali Selamat ◽  
Roselina Sallehuddin

One major characteristic of data is completeness. Missing data is a significant problem in medical datasets. It leads to incorrect classification of patients and is dangerous to the health management of patients. Many factors lead to the missingness of values in databases in medical datasets. In this paper, we propose the need to examine the causes of missing data in a medical dataset to ensure that the right imputation method is used in solving the problem. The mechanism of missingness in datasets was studied to know the missing pattern of datasets and determine a suitable imputation technique to generate complete datasets. The pattern shows that the missingness of the dataset used in this study is not a monotone missing pattern. Also, single imputation techniques underestimate variance and ignore relationships among the variables; therefore, we used multiple imputations technique that runs in five iterations for the imputation of each missing value. The whole missing values in the dataset were 100% regenerated. The imputed datasets were validated using an extreme learning machine (ELM) classifier. The results show improvement in the accuracy of the imputed datasets. The work can, however, be extended to compare the accuracy of the imputed datasets with the original dataset with different classifiers like support vector machine (SVM), radial basis function (RBF), and ELMs.


2018 ◽  
Author(s):  
Mark A. Eckert ◽  
Kenneth I. Vaden ◽  
Mulugeta Gebregziabher ◽  

AbstractChildren with reading disability exhibit varied deficits in reading and cognitive abilities that contribute to their reading comprehension problems. Some children exhibit primary deficits in phonological processing, while others can exhibit deficits in oral language and executive functions that affect comprehension. This behavioral heterogeneity is problematic when missing data prevent the characterization of different reading profiles, which often occurs in retrospective data sharing initiatives without coordinated data collection. Here we show that reading profiles can be reliably identified based on Random Forest classification of incomplete behavioral datasets, after the missForest method is used to multiply impute missing values. Results from simulation analyses showed that reading profiles could be accurately classified across degrees of missingness (e.g., ~5% classification error for 30% missingness across the sample). The application of missForest to a real multi-site dataset (n = 924) showed that reading disability profiles significantly and consistently differed in reading and cognitive abilities for cases with and without missing data. The results of validation analyses indicated that the reading profiles (cases with and without missing data) exhibited significant differences for an independent set of behavioral variables that were not used to classify reading profiles. Together, the results show how multiple imputation can be applied to the classification of cases with missing data and can increase the integrity of results from multi-site open access datasets.


PeerJ ◽  
2020 ◽  
Vol 8 ◽  
pp. e9993
Author(s):  
Christian Niederwanger ◽  
Thomas Varga ◽  
Tobias Hell ◽  
Daniel Stuerzel ◽  
Jennifer Prem ◽  
...  

Background Scores can assess the severity and course of disease and predict outcome in an objective manner. This information is needed for proper risk assessment and stratification. Furthermore, scoring systems support optimal patient care, resource management and are gaining in importance in terms of artificial intelligence. Objective This study evaluated and compared the prognostic ability of various common pediatric scoring systems (PRISM, PRISM III, PRISM IV, PIM, PIM2, PIM3, PELOD, PELOD 2) in order to determine which is the most applicable score for pediatric sepsis patients in terms of timing of disease survey and insensitivity to missing data. Methods We retrospectively examined data from 398 patients under 18 years of age, who were diagnosed with sepsis. Scores were assessed at ICU admission and re-evaluated on the day of peak C-reactive protein. The scores were compared for their ability to predict mortality in this specific patient population and for their impairment due to missing data. Results PIM (AUC 0.76 (0.68–0.76)), PIM2 (AUC 0.78 (0.72–0.78)) and PIM3 (AUC 0.76 (0.68–0.76)) scores together with PRSIM III (AUC 0.75 (0.68–0.75)) and PELOD 2 (AUC 0.75 (0.66–0.75)) are the most suitable scores for determining patient prognosis at ICU admission. Once sepsis is pronounced, PELOD 2 (AUC 0.84 (0.77–0.91)) and PRISM IV (AUC 0.8 (0.72–0.88)) become significantly better in their performance and count among the best prognostic scores for use at this time together with PRISM III (AUC 0.81 (0.73–0.89)). PELOD 2 is good for monitoring and, like the PIM scores, is also largely insensitive to missing values. Conclusion Overall, PIM scores show comparatively good performance, are stable as far as timing of the disease survey is concerned, and they are also relatively stable in terms of missing parameters. PELOD 2 is best suitable for monitoring clinical course.


2011 ◽  
Vol 26 (S2) ◽  
pp. 572-572
Author(s):  
N. Resseguier ◽  
H. Verdoux ◽  
F. Clavel-Chapelon ◽  
X. Paoletti

IntroductionThe CES-D scale is commonly used to assess depressive symptoms (DS) in large population-based studies. Missing values in items of the scale may create biases.ObjectivesTo explore reasons for not completing items of the CES-D scale and to perform sensitivity analysis of the prevalence of DS to assess the impact of different missing data hypotheses.Methods71412 women included in the French E3N cohort returned in 2005 a questionnaire containing the CES-D scale. 45% presented at least one missing value in the scale. An interview study was carried out on a random sample of 204 participants to examine the different hypotheses for the missing value mechanism. The prevalence of DS was estimated according to different methods for handling missing values: complete cases analysis, single imputation, multiple imputation under MAR (missing at random) and MNAR (missing not at random) assumptions.ResultsThe interviews showed that participants were not embarrassed to fill in questions about DS. Potential reasons of nonresponse were identified. MAR and MNAR hypotheses remained plausible and were explored.Among complete responders, the prevalence of DS was 26.1%. After multiple imputation under MAR assumption, it was 28.6%, 29.8% and 31.7% among women presenting up to 4, to 10 and to 20 missing values, respectively. The estimates were robust after applying various scenarios of MNAR data for the sensitivity analysis.ConclusionsThe CES-D scale can easily be used to assess DS in large cohorts. Multiple imputation under MAR assumption allows to reliably handle missing values.


2020 ◽  
Vol 2020 ◽  
pp. 1-11
Author(s):  
Kamran Mehrabani-Zeinabad ◽  
Marziyeh Doostfatemeh ◽  
Seyyed Mohammad Taghi Ayatollahi

Missing data is one of the most important causes in reduction of classification accuracy. Many real datasets suffer from missing values, especially in medical sciences. Imputation is a common way to deal with incomplete datasets. There are various imputation methods that can be applied, and the choice of the best method depends on the dataset conditions such as sample size, missing percent, and missing mechanism. Therefore, the better solution is to classify incomplete datasets without imputation and without any loss of information. The structure of the “Bayesian additive regression trees” (BART) model is improved with the “Missingness Incorporated in Attributes” approach to solve its inefficiency in handling the missingness problem. Implementation of MIA-within-BART is named “BART.m”. As the abilities of BART.m are not investigated in classification of incomplete datasets, this simulation-based study aimed to provide such resource. The results indicate that BART.m can be used even for datasets with 90 missing present and more importantly, it diagnoses the irrelevant variables and removes them by its own. BART.m outperforms common models for classification with incomplete data, according to accuracy and computational time. Based on the revealed properties, it can be said that BART.m is a high accuracy model in classification of incomplete datasets which avoids any assumptions and preprocess steps.


Author(s):  
Yuzhe Liu ◽  
Vanathi Gopalakrishnan

Many clinical research datasets have a large percentage of missing values that directly impacts their usefulness in yielding high accuracy classifiers when used for training in supervised machine learning. While missing value imputation methods have been shown to work well with smaller percentages of missing values, their ability to impute sparse clinical research data can be problem specific. We previously attempted to learn quantitative guidelines for ordering cardiac magnetic resonance imaging during the evaluation for pediatric cardiomyopathy, but missing data significantly reduced our usable sample size. In this work, we sought to determine if increasing the usable sample size through imputation would allow us to learn better guidelines. We first review several machine learning methods for estimating missing data. Then, we apply four popular methods (mean imputation, decision tree, k-nearest neighbors, and self-organizing maps) to a clinical research dataset of pediatric patients undergoing evaluation for cardiomyopathy. Using Bayesian Rule Learning (BRL) to learn ruleset models, we compared the performance of imputation-augmented models versus unaugmented models. We found that all four imputation-augmented models performed similarly to unaugmented models. While imputation did not improve performance, it did provide evidence for the robustness of our learned models.


Sensors ◽  
2019 ◽  
Vol 19 (14) ◽  
pp. 3163 ◽  
Author(s):  
Davide Morelli ◽  
Alessio Rossi ◽  
Massimo Cairo ◽  
David A. Clifton

Wearable physiological monitors have become increasingly popular, often worn during people’s daily life, collecting data 24 hours a day, 7 days a week. In the last decade, these devices have attracted the attention of the scientific community as they allow us to automatically extract information about user physiology (e.g., heart rate, sleep quality and physical activity) enabling inference on their health. However, the biggest issue about the data recorded by wearable devices is the missing values due to motion and mechanical artifacts induced by external stimuli during data acquisition. This missing data could negatively affect the assessment of heart rate (HR) response and estimation of heart rate variability (HRV), that could in turn provide misleading insights concerning the health status of the individual. In this study, we focus on healthy subjects with normal heart activity and investigate the effects of missing variation of the timing between beats (RR-intervals) caused by motion artifacts on HRV features estimation by randomly introducing missing values within a five min time windows of RR-intervals obtained from the nsr2db PhysioNet dataset by using Gilbert burst method. We then evaluate several strategies for estimating HRV in the presence of missing values by interpolating periods of missing values, covering the range of techniques often deployed in the literature, via linear, quadratic, cubic, and cubic spline functions. We thereby compare the HRV features obtained by handling missing data in RR-interval time series against HRV features obtained from the same data without missing values. Finally, we assess the difference between the use of interpolation methods on time (i.e., the timestamp when the heartbeats happen) and on duration (i.e., the duration of the heartbeats), in order to identify the best methodology to handle the missing RR-intervals. The main novel finding of this study is that the interpolation of missing data on time produces more reliable HRV estimations when compared to interpolation on duration. Hence, we can conclude that interpolation on duration modifies the power spectrum of the RR signal, negatively affecting the estimation of the HRV features as the amount of missing values increases. We can conclude that interpolation in time is the optimal method among those considered for handling data with large amounts of missing values, such as data from wearable sensors.


2021 ◽  
Vol 11 (12) ◽  
pp. 1356
Author(s):  
Carlos Traynor ◽  
Tarjinder Sahota ◽  
Helen Tomkinson ◽  
Ignacio Gonzalez-Garcia ◽  
Neil Evans ◽  
...  

Missing data is a universal problem in analysing Real-World Evidence (RWE) datasets. In RWE datasets, there is a need to understand which features best correlate with clinical outcomes. In this context, the missing status of several biomarkers may appear as gaps in the dataset that hide meaningful values for analysis. Imputation methods are general strategies that replace missing values with plausible values. Using the Flatiron NSCLC dataset, including more than 35,000 subjects, we compare the imputation performance of six such methods on missing data: predictive mean matching, expectation-maximisation, factorial analysis, random forest, generative adversarial networks and multivariate imputations with tabular networks. We also conduct extensive synthetic data experiments with structural causal models. Statistical learning from incomplete datasets should select an appropriate imputation algorithm accounting for the nature of missingness, the impact of missing data, and the distribution shift induced by the imputation algorithm. For our synthetic data experiments, tabular networks had the best overall performance. Methods using neural networks are promising for complex datasets with non-linearities. However, conventional methods such as predictive mean matching work well for the Flatiron NSCLC biomarker dataset.


2021 ◽  
pp. 001316442110220
Author(s):  
David Goretzko

Determining the number of factors in exploratory factor analysis is arguably the most crucial decision a researcher faces when conducting the analysis. While several simulation studies exist that compare various so-called factor retention criteria under different data conditions, little is known about the impact of missing data on this process. Hence, in this study, we evaluated the performance of different factor retention criteria—the Factor Forest, parallel analysis based on a principal component analysis as well as parallel analysis based on the common factor model and the comparison data approach—in combination with different missing data methods, namely an expectation-maximization algorithm called Amelia, predictive mean matching, and random forest imputation within the multiple imputations by chained equations (MICE) framework as well as pairwise deletion with regard to their accuracy in determining the number of factors when data are missing. Data were simulated for different sample sizes, numbers of factors, numbers of manifest variables (indicators), between-factor correlations, missing data mechanisms and proportions of missing values. In the majority of conditions and for all factor retention criteria except the comparison data approach, the missing data mechanism had little impact on the accuracy and pairwise deletion performed comparably well as the more sophisticated imputation methods. In some conditions, especially small-sample cases and when comparison data were used to determine the number of factors, random forest imputation was preferable to other missing data methods, though. Accordingly, depending on data characteristics and the selected factor retention criterion, choosing an appropriate missing data method is crucial to obtain a valid estimate of the number of factors to extract.


Sign in / Sign up

Export Citation Format

Share Document