Using the CES-D scale in a large cohort study and dealing with missing data: Application to the French E3N cohort

IntroductionThe CES-D scale is commonly used to assess depressive symptoms (DS) in large population-based studies. Missing values in items of the scale may create biases.ObjectivesTo explore reasons for not completing items of the CES-D scale and to perform sensitivity analysis of the prevalence of DS to assess the impact of different missing data hypotheses.Methods71412 women included in the French E3N cohort returned in 2005 a questionnaire containing the CES-D scale. 45% presented at least one missing value in the scale. An interview study was carried out on a random sample of 204 participants to examine the different hypotheses for the missing value mechanism. The prevalence of DS was estimated according to different methods for handling missing values: complete cases analysis, single imputation, multiple imputation under MAR (missing at random) and MNAR (missing not at random) assumptions.ResultsThe interviews showed that participants were not embarrassed to fill in questions about DS. Potential reasons of nonresponse were identified. MAR and MNAR hypotheses remained plausible and were explored.Among complete responders, the prevalence of DS was 26.1%. After multiple imputation under MAR assumption, it was 28.6%, 29.8% and 31.7% among women presenting up to 4, to 10 and to 20 missing values, respectively. The estimates were robust after applying various scenarios of MNAR data for the sensitivity analysis.ConclusionsThe CES-D scale can easily be used to assess DS in large cohorts. Multiple imputation under MAR assumption allows to reliably handle missing values.

Download Full-text

A review of the use of controlled multiple imputation in randomised controlled trials with missing outcome data

BMC Medical Research Methodology ◽

10.1186/s12874-021-01261-6 ◽

2021 ◽

Vol 21 (1) ◽

Author(s):

Ping-Tee Tan ◽

Suzie Cro ◽

Eleanor Van Vogt ◽

Matyas Szigeti ◽

Victoria R. Cornelius

Keyword(s):

Sensitivity Analysis ◽

Missing Data ◽

Multiple Imputation ◽

Randomised Controlled Trials ◽

Missing At Random ◽

Sensitivity Analyses ◽

Controlled Trials ◽

Primary Analysis ◽

Randomised Controlled ◽

The Impact

Abstract Background Missing data are common in randomised controlled trials (RCTs) and can bias results if not handled appropriately. A statistically valid analysis under the primary missing-data assumptions should be conducted, followed by sensitivity analysis under alternative justified assumptions to assess the robustness of results. Controlled Multiple Imputation (MI) procedures, including delta-based and reference-based approaches, have been developed for analysis under missing-not-at-random assumptions. However, it is unclear how often these methods are used, how they are reported, and what their impact is on trial results. This review evaluates the current use and reporting of MI and controlled MI in RCTs. Methods A targeted review of phase II-IV RCTs (non-cluster randomised) published in two leading general medical journals (The Lancet and New England Journal of Medicine) between January 2014 and December 2019 using MI. Data was extracted on imputation methods, analysis status, and reporting of results. Results of primary and sensitivity analyses for trials using controlled MI analyses were compared. Results A total of 118 RCTs (9% of published RCTs) used some form of MI. MI under missing-at-random was used in 110 trials; this was for primary analysis in 43/118 (36%), and in sensitivity analysis for 70/118 (59%) (3 used in both). Sixteen studies performed controlled MI (1.3% of published RCTs), either with a delta-based (n = 9) or reference-based approach (n = 7). Controlled MI was mostly used in sensitivity analysis (n = 14/16). Two trials used controlled MI for primary analysis, including one reporting no sensitivity analysis whilst the other reported similar results without imputation. Of the 14 trials using controlled MI in sensitivity analysis, 12 yielded comparable results to the primary analysis whereas 2 demonstrated contradicting results. Only 5/110 (5%) trials using missing-at-random MI and 5/16 (31%) trials using controlled MI reported complete details on MI methods. Conclusions Controlled MI enabled the impact of accessible contextually relevant missing data assumptions to be examined on trial results. The use of controlled MI is increasing but is still infrequent and poorly reported where used. There is a need for improved reporting on the implementation of MI analyses and choice of controlled MI parameters.

Download Full-text

P427 A hybrid approach of handling missing data in inflammatory bowel disease (IBD) trials: results from VISIBLE 1 and VARSITY

Journal of Crohn s and Colitis ◽

10.1093/ecco-jcc/jjz203.556 ◽

2020 ◽

Vol 14 (Supplement_1) ◽

pp. S388-S389

Author(s):

J Chen ◽

S Hunter ◽

K Kisfalvi ◽

R A Lirio

Keyword(s):

Sensitivity Analysis ◽

Missing Data ◽

Statistical Power ◽

Hybrid Approach ◽

Missing At Random ◽

P Value ◽

Two Phase ◽

Treatment Difference ◽

Mayo Score ◽

The Impact

Abstract Background Missing data is common in IBD trials. Depending on the volume and nature of missing data, it can reduce statistical power for detecting treatment difference, introduce potential bias and invalidate conclusions. Non-responder imputation (NRI), where patients (patients) with missing data are considered treatment failures, is widely used to handle missing data for dichotomous efficacy endpoints in IBD trials. However, it does not consider the mechanisms leading to missing data and can potentially underestimate the treatment effect. We proposed a hybrid (HI) approach combining NRI and multiple imputation (MI) as an alternative to NRI in the analyses of two phase 3 trials of vedolizumab (VDZ) in patients with moderate-to-severe UC – VISIBLE 11 and VARSITY2. Methods VISIBLE 1 and VARSITY assessed efficacy using dichotomous endpoints based on complete Mayo score. Full methodologies reported previously.1,2 Our proposed HI approach is aimed at imputing missing Mayo scores, instead of imputing the missing dichotomous efficacy endpoint. To assess the impact of dropouts for different missing data mechanisms (categorised as ‘missing not at random [MNAR]’ and ‘missing at random [MAR]’, HI was implemented as a potential sensitivity analysis, where dropouts owing to safety or lack of efficacy were imputed using NRI (assuming MNAR) and other missing data were imputed using MI (assuming MAR). For MI, each component of the Mayo score was imputed via a multivariate stepwise approach using a fully conditional specification ordinal logistic method. Missing baseline scores were imputed using baseline characteristics data. Missing scores from each subsequent visit were imputed using all previous visits in a stepwise fashion. Fifty imputation datasets were computed for each component of Mayo score. The complete Mayo score and relevant efficacy endpoints were derived subsequently. The analysis was performed within each imputed dataset to determine treatment difference, 95% CI and p-value, which were then combined via Rubin’s rules3. Results Tables 1 and 2 show a comparison of efficacy in the two studies using the primary NRI analysis vs. the alternative HI approach for handling missing data. Conclusion HI and NRI approaches can provide consistent efficacy analyses in IBD trials. The HI approach can serve as a useful sensitivity analysis to assess the impact of dropouts under different missing data mechanisms and evaluate the robustness of efficacy conclusions. Reference

Download Full-text

Missing not at random in end of life care studies: multiple imputation and sensitivity analysis on data from the ACTION study

BMC Medical Research Methodology ◽

10.1186/s12874-020-01180-y ◽

2021 ◽

Vol 21 (1) ◽

Author(s):

Giulia Carreras ◽

◽

Guido Miccinesi ◽

Andrew Wilcock ◽

Nancy Preston ◽

...

Keyword(s):

Sensitivity Analysis ◽

Multiple Imputation ◽

End Of Life ◽

End Of Life Care ◽

Missing Values ◽

Controlled Trial ◽

Missing At Random ◽

Life Care ◽

Missing Not At Random ◽

Cluster Randomized

Abstract Background Missing data are common in end-of-life care studies, but there is still relatively little exploration of which is the best method to deal with them, and, in particular, if the missing at random (MAR) assumption is valid or missing not at random (MNAR) mechanisms should be assumed. In this paper we investigated this issue through a sensitivity analysis within the ACTION study, a multicenter cluster randomized controlled trial testing advance care planning in patients with advanced lung or colorectal cancer. Methods Multiple imputation procedures under MAR and MNAR assumptions were implemented. Possible violation of the MAR assumption was addressed with reference to variables measuring quality of life and symptoms. The MNAR model assumed that patients with worse health were more likely to have missing questionnaires, making a distinction between single missing items, which were assumed to satisfy the MAR assumption, and missing values due to completely missing questionnaire for which a MNAR mechanism was hypothesized. We explored the sensitivity to possible departures from MAR on gender differences between key indicators and on simple correlations. Results Up to 39% of follow-up data were missing. Results under MAR reflected that missingness was related to poorer health status. Correlations between variables, although very small, changed according to the imputation method, as well as the differences in scores by gender, indicating a certain sensitivity of the results to the violation of the MAR assumption. Conclusions The findings confirmed the importance of undertaking this kind of analysis in end-of-life care studies.

Download Full-text

Handling missing data in modelling quality of clinician-prescribed routine care: Sensitivity analysis of departure from missing at random assumption

Statistical Methods in Medical Research ◽

10.1177/0962280220918279 ◽

2020 ◽

Vol 29 (10) ◽

pp. 3076-3092 ◽

Cited By ~ 1

Author(s):

Susan Gachau ◽

Matteo Quartagno ◽

Edmund Njeru Njagi ◽

Nelson Owuor ◽

Mike English ◽

...

Keyword(s):

Sensitivity Analysis ◽

Missing Data ◽

Multiple Imputation ◽

Missing At Random ◽

Parameter Estimates ◽

Analysis Model ◽

Major Drawback ◽

Missing Not At Random ◽

Prior Distributions ◽

Random Mechanism

Missing information is a major drawback in analyzing data collected in many routine health care settings. Multiple imputation assuming a missing at random mechanism is a popular method to handle missing data. The missing at random assumption cannot be confirmed from the observed data alone, hence the need for sensitivity analysis to assess robustness of inference. However, sensitivity analysis is rarely conducted and reported in practice. We analyzed routine paediatric data collected during a cluster randomized trial conducted in Kenyan hospitals. We imputed missing patient and clinician-level variables assuming the missing at random mechanism. We also imputed missing clinician-level variables assuming a missing not at random mechanism. We incorporated opinions from 15 clinical experts in the form of prior distributions and shift parameters in the delta adjustment method. An interaction between trial intervention arm and follow-up time, hospital, clinician and patient-level factors were included in a proportional odds random-effects analysis model. We performed these analyses using R functions derived from the jomo package. Parameter estimates from multiple imputation under the missing at random mechanism were similar to multiple imputation estimates assuming the missing not at random mechanism. Our inferences were insensitive to departures from the missing at random assumption using either the prior distributions or shift parameters sensitivity analysis approach.

Download Full-text

Kernel weighted least square approach for imputing missing values of metabolomics data

Scientific Reports ◽

10.1038/s41598-021-90654-0 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Nishith Kumar ◽

Md. Aminul Hoque ◽

Masahiro Sugimoto

Keyword(s):

Missing Data ◽

Large Scale ◽

Missing Values ◽

Kernel Weight ◽

Least Square ◽

Data Matrix ◽

Data Imputation ◽

Metabolomics Data ◽

Missing Value ◽

Missing Data Imputation

AbstractMass spectrometry is a modern and sophisticated high-throughput analytical technique that enables large-scale metabolomic analyses. It yields a high-dimensional large-scale matrix (samples × metabolites) of quantified data that often contain missing cells in the data matrix as well as outliers that originate for several reasons, including technical and biological sources. Although several missing data imputation techniques are described in the literature, all conventional existing techniques only solve the missing value problems. They do not relieve the problems of outliers. Therefore, outliers in the dataset decrease the accuracy of the imputation. We developed a new kernel weight function-based proposed missing data imputation technique that resolves the problems of missing values and outliers. We evaluated the performance of the proposed method and other conventional and recently developed missing imputation techniques using both artificially generated data and experimentally measured data analysis in both the absence and presence of different rates of outliers. Performances based on both artificial data and real metabolomics data indicate the superiority of our proposed kernel weight-based missing data imputation technique to the existing alternatives. For user convenience, an R package of the proposed kernel weight-based missing value imputation technique was developed, which is available at https://github.com/NishithPaul/tWLSA.

Download Full-text

A data-driven missing value imputation approach for longitudinal datasets

Artificial Intelligence Review ◽

10.1007/s10462-021-09963-5 ◽

2021 ◽

Author(s):

Caio Ribeiro ◽

Alex A. Freitas

Keyword(s):

Missing Data ◽

Longitudinal Data ◽

Missing Values ◽

Error Rates ◽

Imputation Method ◽

Data Driven ◽

Missing Value ◽

Missing Value Imputation ◽

Human Ageing ◽

Imputation Approach

AbstractLongitudinal datasets of human ageing studies usually have a high volume of missing data, and one way to handle missing values in a dataset is to replace them with estimations. However, there are many methods to estimate missing values, and no single method is the best for all datasets. In this article, we propose a data-driven missing value imputation approach that performs a feature-wise selection of the best imputation method, using known information in the dataset to rank the five methods we selected, based on their estimation error rates. We evaluated the proposed approach in two sets of experiments: a classifier-independent scenario, where we compared the applicabilities and error rates of each imputation method; and a classifier-dependent scenario, where we compared the predictive accuracy of Random Forest classifiers generated with datasets prepared using each imputation method and a baseline approach of doing no imputation (letting the classification algorithm handle the missing values internally). Based on our results from both sets of experiments, we concluded that the proposed data-driven missing value imputation approach generally resulted in models with more accurate estimations for missing data and better performing classifiers, in longitudinal datasets of human ageing. We also observed that imputation methods devised specifically for longitudinal data had very accurate estimations. This reinforces the idea that using the temporal information intrinsic to longitudinal data is a worthwhile endeavour for machine learning applications, and that can be achieved through the proposed data-driven approach.

Download Full-text

Impact of strong El Niño events on river discharge in South America

10.5194/egusphere-egu21-10383 ◽

2021 ◽

Author(s):

Markus Deppner ◽

Bedartha Goswami

Keyword(s):

Machine Learning ◽

Missing Data ◽

South America ◽

River Discharge ◽

Amazon Basin ◽

Missing Values ◽

Southern Oscillation ◽

Enso Events ◽

Streamflow Data ◽

The Impact

<p>The impact of the El Ni&#241;o Southern Oscillation (ENSO) on rivers are well known, but most existing studies involving streamflow data are severely limited by data coverage. Time series of gauging stations fade in and out over time, which makes hydrological large scale and long time analysis or studies of rarely occurring extreme events challenging. Here, we use a machine learning approach to infer missing streamflow data based on temporal correlations of stations with missing values to others with data. By using 346 stations, from the &#8220;Global Streamflow Indices and Metadata archive&#8221; (GSIM), that initially cover the 40 year timespan in conjunction with Gaussian processes we were able to extend our data by estimating missing data for an additional 646 stations, allowing us to include a total of 992 stations. We then investigate the impact of the 6 strongest El Ni&#241;o (EN) events on rivers in South America between 1960 and 2000. Our analysis shows a strong correlation between ENSO events and extreme river dynamics in the southeast of Brazil, Carribean South America and parts of the Amazon basin. Furthermore we see a peak in the number of stations showing maximum river discharge all over Brazil during the EN of 1982/83 which has been linked to severe floods in the east of Brazil, parts of Uruguay and Paraguay. However EN events in other years with similar intensity did not evoke floods with such magnitude and therefore the additional drivers of the 1982/83&#160; floods need further investigation. By using machine learning methods to infer data for gauging stations with missing data we were able to extend our data by almost three-fold, revealing a possible heavier and spatially larger impact of the 1982/83 EN on South America's hydrology than indicated in literature.</p>

Download Full-text

Primary Care Consultations after Hospitalisation for Pneumonia: A Large Population-based Cohort Study

British Journal of General Practice ◽

10.3399/bjgp.2020.0890 ◽

2020 ◽

pp. BJGP.2020.0890

Author(s):

Vadsala Baskaran ◽

Fiona Pearce ◽

Rowan H Harwood ◽

Tricia McKeever ◽

Wei Shen Lim

Keyword(s):

Primary Care ◽

Cohort Study ◽

Large Population ◽

Significant Proportion ◽

Antibiotic Use ◽

Population Based ◽

Practice Research ◽

Clinical Practice Research Datalink ◽

Respiratory Disorder ◽

The Impact

Background: Up to 70% of patients report ongoing symptoms four weeks after hospitalisation for pneumonia, and the impact on primary care is poorly understood. Aim: To investigate the frequency of primary care consultations after hospitalisation for pneumonia, and the reasons for consultation. Design: Population-based cohort study. Setting: UK primary care database of anonymised medical records (Clinical Practice Research Datalink, CPRD) linked to Hospital Episode Statistics (HES), England. Methods: Adults with the first ICD-10 code for pneumonia (J12-J18) recorded in HES between July 2002-June 2017 were included. Primary care consultation within 30 days of discharge was identified as the recording of any medical Read code (excluding administration-related codes) in CPRD. Competing-risks regression analyses were conducted to determine the predictors of consultation and antibiotic use at consultation; death and readmission were competing events. Reasons for consultation were examined. Results: Of 56,396 adults, 55.9% (n=31,542) consulted primary care within 30 days of discharge. The rate of consultation was highest within 7 days (4.7 per 100 person-days). The strongest predictor for consultation was a higher number of primary care consultations in the year prior to index admission (adjusted sHR 8.98, 95% CI 6.42-12.55). The commonest reason for consultation was for a respiratory disorder (40.7%, n=12,840), 12% for pneumonia specifically. At consultation, 31.1% (n=9,823) received further antibiotics. Penicillins (41.6%, n=5,753) and macrolides (21.9%, n=3,029) were the commonest antibiotics prescribed. Conclusion: Following hospitalisation for pneumonia, a significant proportion of patients consulted primary care within 30 days, highlighting the morbidity experienced by patients during recovery from pneumonia.

Download Full-text

Weighted multiple imputation of ethnicity data that are missing not at random in primary care databases

International Journal for Population Data Science ◽

10.23889/ijpds.v1i1.54 ◽

2017 ◽

Vol 1 (1) ◽

Author(s):

Tra My Pham ◽

Irene Petersen ◽

James Carpenter ◽

Tim Morris

Keyword(s):

Primary Care ◽

Missing Data ◽

Multiple Imputation ◽

Simulation Study ◽

Case Analysis ◽

Missing At Random ◽

Complete Case ◽

Missing Not At Random ◽

Health Records ◽

Ethnicity Data

ABSTRACT BackgroundEthnicity is an important factor to be considered in health research because of its association with inequality in disease prevalence and the utilisation of healthcare. Ethnicity recording has been incorporated in primary care electronic health records, and hence is available in large UK primary care databases such as The Health Improvement Network (THIN). However, since primary care data are routinely collected for clinical purposes, a large amount of data that are relevant for research including ethnicity is often missing. A popular approach for missing data is multiple imputation (MI). However, the conventional MI method assuming data are missing at random does not give plausible estimates of the ethnicity distribution in THIN compared to the general UK population. This might be due to the fact that ethnicity data in primary care are likely to be missing not at random. ObjectivesI propose a new MI method, termed ‘weighted multiple imputation’, to deal with data that are missing not at random in categorical variables.MethodsWeighted MI combines MI and probability weights which are calculated using external data sources. Census summary statistics for ethnicity can be used to form weights in weighted MI such that the correct marginal ethnic breakdown is recovered in THIN. I conducted a simulation study to examine weighted MI when ethnicity data are missing not at random. In this simulation study which resembled a THIN dataset, ethnicity was an independent variable in a survival model alongside other covariates. Weighted MI was compared to the conventional MI and other traditional missing data methods including complete case analysis and single imputation.ResultsWhile a small bias was still present in ethnicity coefficient estimates under weighted MI, it was less severe compared to MI assuming missing at random. Complete case analysis and single imputation were inadequate to handle data that are missing not at random in ethnicity.ConclusionsAlthough not a total cure, weighted MI represents a pragmatic approach that has potential applications not only in ethnicity but also in other incomplete categorical health indicators in electronic health records.

Download Full-text

Comparison of Missing Data Infilling Mechanisms for Recovering a Real-World Single Station Streamflow Observation

International Journal of Environmental Research and Public Health ◽

10.3390/ijerph18168375 ◽

2021 ◽

Vol 18 (16) ◽

pp. 8375

Author(s):

Thelma Dede Baddoo ◽

Zhijia Li ◽

Samuel Nii Odai ◽

Kenneth Rodolphe Chabi Boni ◽

Isaac Kwesi Nooni ◽

...

Keyword(s):

Missing Data ◽

Multiple Imputation ◽

Real World ◽

Missing Values ◽

Total Error ◽

Extensive Study ◽

Error Measurement ◽

Missing Data Imputation ◽

Single Station ◽

Real World Datasets

Reconstructing missing streamflow data can be challenging when additional data are not available, and missing data imputation of real-world datasets to investigate how to ascertain the accuracy of imputation algorithms for these datasets are lacking. This study investigated the necessary complexity of missing data reconstruction schemes to obtain the relevant results for a real-world single station streamflow observation to facilitate its further use. This investigation was implemented by applying different missing data mechanisms spanning from univariate algorithms to multiple imputation methods accustomed to multivariate data taking time as an explicit variable. The performance accuracy of these schemes was assessed using the total error measurement (TEM) and a recommended localized error measurement (LEM) in this study. The results show that univariate missing value algorithms, which are specially developed to handle univariate time series, provide satisfactory results, but the ones which provide the best results are usually time and computationally intensive. Also, multiple imputation algorithms which consider the surrounding observed values and/or which can understand the characteristics of the data provide similar results to the univariate missing data algorithms and, in some cases, perform better without the added time and computational downsides when time is taken as an explicit variable. Furthermore, the LEM would be especially useful when the missing data are in specific portions of the dataset or where very large gaps of ‘missingness’ occur. Finally, proper handling of missing values of real-world hydroclimatic datasets depends on imputing and extensive study of the particular dataset to be imputed.

Download Full-text