Other Issues in Statistics I

This chapter discusses the problem of incomplete or missing data. The three types of missing data mechanisms are examined: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). It discusses how to reduce its occurrence using trial design and improving the data collection process. The chapter also provides methods to control this factor during the analysis stage, using some strategies such as not replacing the lost data (complete case analysis), replacing each lost value with a single value (single imputation), replacing the lost data with multiple values for each lost observation (multiple imputation). It then discusses sensitivity analysis, which measures the impact on the results from different methods of handling missing data, and it helps to justify the choice of the particular method applied. Finally, it reviews covariate adjustment as another topic in statistics.

Download Full-text

Weighted multiple imputation of ethnicity data that are missing not at random in primary care databases

International Journal for Population Data Science ◽

10.23889/ijpds.v1i1.54 ◽

2017 ◽

Vol 1 (1) ◽

Author(s):

Tra My Pham ◽

Irene Petersen ◽

James Carpenter ◽

Tim Morris

Keyword(s):

Primary Care ◽

Missing Data ◽

Multiple Imputation ◽

Simulation Study ◽

Case Analysis ◽

Missing At Random ◽

Complete Case ◽

Missing Not At Random ◽

Health Records ◽

Ethnicity Data

ABSTRACT BackgroundEthnicity is an important factor to be considered in health research because of its association with inequality in disease prevalence and the utilisation of healthcare. Ethnicity recording has been incorporated in primary care electronic health records, and hence is available in large UK primary care databases such as The Health Improvement Network (THIN). However, since primary care data are routinely collected for clinical purposes, a large amount of data that are relevant for research including ethnicity is often missing. A popular approach for missing data is multiple imputation (MI). However, the conventional MI method assuming data are missing at random does not give plausible estimates of the ethnicity distribution in THIN compared to the general UK population. This might be due to the fact that ethnicity data in primary care are likely to be missing not at random. ObjectivesI propose a new MI method, termed ‘weighted multiple imputation’, to deal with data that are missing not at random in categorical variables.MethodsWeighted MI combines MI and probability weights which are calculated using external data sources. Census summary statistics for ethnicity can be used to form weights in weighted MI such that the correct marginal ethnic breakdown is recovered in THIN. I conducted a simulation study to examine weighted MI when ethnicity data are missing not at random. In this simulation study which resembled a THIN dataset, ethnicity was an independent variable in a survival model alongside other covariates. Weighted MI was compared to the conventional MI and other traditional missing data methods including complete case analysis and single imputation.ResultsWhile a small bias was still present in ethnicity coefficient estimates under weighted MI, it was less severe compared to MI assuming missing at random. Complete case analysis and single imputation were inadequate to handle data that are missing not at random in ethnicity.ConclusionsAlthough not a total cure, weighted MI represents a pragmatic approach that has potential applications not only in ethnicity but also in other incomplete categorical health indicators in electronic health records.

Download Full-text

Addressing Missing Data in Untargeted Metabolomics: Identifying Missingness Type and Evaluating the Impact of Imputation Meth-ods on Experimental Replication

10.33774/chemrxiv-2021-v9djt ◽

2021 ◽

Author(s):

Trenton J. Davis ◽

Tarek R. Firzli ◽

Emily A. Higgins Keppler ◽

Matt Richardson ◽

Heather D. Bean

Keyword(s):

Missing Data ◽

Biomarker Discovery ◽

Missing At Random ◽

R Package ◽

Data Sets ◽

Regression Imputation ◽

Missing Completely At Random ◽

Components Analysis ◽

Experimental Replication ◽

The Impact

Missing data is a significant issue in metabolomics that is often neglected when conducting data pre-processing, particularly when it comes to imputation. This can have serious implications for downstream statistical analyses and lead to misleading or uninterpretable inferences. In this study, we aim to identify the primary types of missingness that affect untargeted metab-olomics data and compare strategies for imputation using two real-world comprehensive two-dimensional gas chromatog-raphy (GC×GC) data sets. We also present these goals in the context of experimental replication whereby imputation is con-ducted in a within-replicate-based fashion—the first description and evaluation of this strategy—and introduce an R package MetabImpute to carry out these analyses. Our results conclude that, in these two data sets, missingness was most likely of the missing at-random (MAR) and missing not-at-random (MNAR) types as opposed to missing completely at-random (MCAR). Gibbs sampler imputation and Random Forest gave the best results when imputing MAR and MNAR compared against single-value imputation (zero, minimum, mean, median, and half-minimum) and other more sophisticated approach-es (Bayesian principal components analysis and quantile regression imputation for left-censored data). When samples are replicated, within-replicate imputation approaches led to an increase in the reproducibility of peak quantification compared to imputation that ignores replication, suggesting that imputing with respect to replication may preserve potentially im-portant features in downstream analyses for biomarker discovery.

Download Full-text

JMASM 54: A Comparison of Four Different Estimation Approaches for Prognostic Survival Oral Cancer Model

Journal of Modern Applied Statistical Methods ◽

10.22237/jmasm/1594045320 ◽

2020 ◽

Vol 18 (2) ◽

pp. 2-6

Author(s):

Thomas R. Knapp

Keyword(s):

Missing Data ◽

Oral Cancer ◽

Missing At Random ◽

Cancer Model ◽

Missing Not At Random ◽

Opposing View ◽

Missing Completely At Random ◽

Almost All

Rubin (1976, and elsewhere) claimed that there are three kinds of “missingness”: missing completely at random; missing at random; and missing not at random. He gave examples of each. The article that now follows takes an opposing view by arguing that almost all missing data are missing not at random.

Download Full-text

Using causal diagrams to guide analysis in missing data problems

Statistical Methods in Medical Research ◽

10.1177/0962280210394469 ◽

2011 ◽

Vol 21 (3) ◽

pp. 243-256 ◽

Cited By ~ 78

Author(s):

Rhian M Daniel ◽

Michael G Kenward ◽

Simon N Cousens ◽

Bianca L De Stavola

Keyword(s):

Missing Data ◽

Incomplete Data ◽

Missing At Random ◽

Causal Effects ◽

Missing Not At Random ◽

Missing Completely At Random ◽

Causal Diagrams ◽

Back Door ◽

Occupational Cohort

Estimating causal effects from incomplete data requires additional and inherently untestable assumptions regarding the mechanism giving rise to the missing data. We show that using causal diagrams to represent these additional assumptions both complements and clarifies some of the central issues in missing data theory, such as Rubin's classification of missingness mechanisms (as missing completely at random (MCAR), missing at random (MAR) or missing not at random (MNAR)) and the circumstances in which causal effects can be estimated without bias by analysing only the subjects with complete data. In doing so, we formally extend the back-door criterion of Pearl and others for use in incomplete data examples. These ideas are illustrated with an example drawn from an occupational cohort study of the effect of cosmic radiation on skin cancer incidence.

Download Full-text

Indiscriminant Expected Maximization Imputation Model using Multiple Classification Technique on diabetic Dataset

International Journal of Engineering and Advanced Technology - Regular Issue ◽

10.35940/ijeat.f9516.088619 ◽

2019 ◽

Vol 8 (6) ◽

pp. 3449-3455

Keyword(s):

Missing Data ◽

Missing At Random ◽

Missing Information ◽

Imputation Model ◽

Missing Not At Random ◽

Missing Completely At Random ◽

Classification Technique ◽

Accessible Information ◽

Multiple Classification ◽

Pessimistic Scenario

The missing data is inescapable in clinical research. While individuals with their left out or missing data may balance out when regards to those with no missing data to the extent the aftereffect interest in gauge all around. Missing data is of three types :Missing at random (MAR),Missing completely at random (MCAR), and Missing not at random (MNAR). In a medicalstudy, missing data is, to a great extent, MCAR. Missing information can build up deep troubles in the assessments and perception of results and undermine the authenticity of results to finish.Various strategies have been created for managing missing information. These incorporate total case examinations, missing pointer technique, single worth imputation, and affectability investigations were joining most pessimistic scenario while greate-case situations. Whenever connected the MCAR suspicion, a portion of these techniques can give unprejudiced, however frequently less exact assessments. Single value imputation an elective technique to manage missing information, which records for the vulnerability related to missing information. Single value imputation is actualized in most factual programming under the MAR suspicion and gives fair and substantial evaluations of affiliations dependent on data from the accessible information. The technique influences not just the coefficient gauges for factors with missing information, yet additionally the appraisals for different factors with zero missing information.

Download Full-text

P427 A hybrid approach of handling missing data in inflammatory bowel disease (IBD) trials: results from VISIBLE 1 and VARSITY

Journal of Crohn s and Colitis ◽

10.1093/ecco-jcc/jjz203.556 ◽

2020 ◽

Vol 14 (Supplement_1) ◽

pp. S388-S389

Author(s):

J Chen ◽

S Hunter ◽

K Kisfalvi ◽

R A Lirio

Keyword(s):

Sensitivity Analysis ◽

Missing Data ◽

Statistical Power ◽

Hybrid Approach ◽

Missing At Random ◽

P Value ◽

Two Phase ◽

Treatment Difference ◽

Mayo Score ◽

The Impact

Abstract Background Missing data is common in IBD trials. Depending on the volume and nature of missing data, it can reduce statistical power for detecting treatment difference, introduce potential bias and invalidate conclusions. Non-responder imputation (NRI), where patients (patients) with missing data are considered treatment failures, is widely used to handle missing data for dichotomous efficacy endpoints in IBD trials. However, it does not consider the mechanisms leading to missing data and can potentially underestimate the treatment effect. We proposed a hybrid (HI) approach combining NRI and multiple imputation (MI) as an alternative to NRI in the analyses of two phase 3 trials of vedolizumab (VDZ) in patients with moderate-to-severe UC – VISIBLE 11 and VARSITY2. Methods VISIBLE 1 and VARSITY assessed efficacy using dichotomous endpoints based on complete Mayo score. Full methodologies reported previously.1,2 Our proposed HI approach is aimed at imputing missing Mayo scores, instead of imputing the missing dichotomous efficacy endpoint. To assess the impact of dropouts for different missing data mechanisms (categorised as ‘missing not at random [MNAR]’ and ‘missing at random [MAR]’, HI was implemented as a potential sensitivity analysis, where dropouts owing to safety or lack of efficacy were imputed using NRI (assuming MNAR) and other missing data were imputed using MI (assuming MAR). For MI, each component of the Mayo score was imputed via a multivariate stepwise approach using a fully conditional specification ordinal logistic method. Missing baseline scores were imputed using baseline characteristics data. Missing scores from each subsequent visit were imputed using all previous visits in a stepwise fashion. Fifty imputation datasets were computed for each component of Mayo score. The complete Mayo score and relevant efficacy endpoints were derived subsequently. The analysis was performed within each imputed dataset to determine treatment difference, 95% CI and p-value, which were then combined via Rubin’s rules3. Results Tables 1 and 2 show a comparison of efficacy in the two studies using the primary NRI analysis vs. the alternative HI approach for handling missing data. Conclusion HI and NRI approaches can provide consistent efficacy analyses in IBD trials. The HI approach can serve as a useful sensitivity analysis to assess the impact of dropouts under different missing data mechanisms and evaluate the robustness of efficacy conclusions. Reference

Download Full-text

A four-step strategy for handling missing outcome data in randomised trials affected by a pandemic

10.21203/rs.3.rs-32455/v2 ◽

2020 ◽

Author(s):

Suzie Cro ◽

Tim P Morris ◽

Brennan C Kahan ◽

Victoria R Cornelius ◽

James R Carpenter

Keyword(s):

Sensitivity Analysis ◽

Missing Data ◽

Treatment Effect ◽

Missing At Random ◽

Outcome Data ◽

Sensitivity Analyses ◽

Free World ◽

Randomised Trials ◽

Primary Analysis ◽

Missing Not At Random

Abstract Background: The coronavirus pandemic (Covid-19) presents a variety of challenges for ongoing clinical trials, including an inevitably higher rate of missing outcome data, with new and non-standard reasons for missingness. International drug trial guidelines recommend trialists review plans for handling missing data in the conduct and statistical analysis, but clear recommendations are lacking.Methods: We present a four-step strategy for handling missing outcome data in the analysis of randomised trials that are ongoing during a pandemic. We consider handling missing data arising due to (i) participant infection, (ii) treatment disruptions and (iii) loss to follow-up. We consider both settings where treatment effects for a ‘pandemic-free world’ and ‘world including a pandemic’ are of interest. Results: In any trial, investigators should; (1) Clarify the treatment estimand of interest with respect to the occurrence of the pandemic; (2) Establish what data are missing for the chosen estimand; (3) Perform primary analysis under the most plausible missing data assumptions followed by; (4) Sensitivity analysis under alternative plausible assumptions. To obtain an estimate of the treatment effect in a ‘pandemic-free world’, participant data that are clinically affected by the pandemic (directly due to infection or indirectly via treatment disruptions) are not relevant and can be set to missing. For primary analysis, a missing-at-random assumption that conditions on all observed data that are expected to be associated with both the outcome and missingness may be most plausible. For the treatment effect in the ‘world including a pandemic’, all participant data is relevant and should be included in the analysis. For primary analysis, a missing-at-random assumption – potentially incorporating a pandemic time-period indicator and participant infection status – or a missing-not-at-random assumption with a poorer response may be most relevant, depending on the setting. In all scenarios, sensitivity analysis under credible missing-not-at-random assumptions should be used to evaluate the robustness of results. We highlight controlled multiple imputation as an accessible tool for conducting sensitivity analyses.Conclusions: Missing data problems will be exacerbated for trials active during the Covid-19 pandemic. This four-step strategy will facilitate clear thinking about the appropriate analysis for relevant questions of interest.

Download Full-text

An Alternative Sensitivity Approach for Longitudinal Analysis with Dropout

Journal of Probability and Statistics ◽

10.1155/2019/1019303 ◽

2019 ◽

Vol 2019 ◽

pp. 1-10

Author(s):

Amal Almohisen ◽

Robin Henderson ◽

Arwa M. Alshingiti

Keyword(s):

Clinical Trial ◽

Sensitivity Analysis ◽

Simulated Data ◽

Real Data ◽

Missing At Random ◽

Parameter Estimates ◽

Mixed Effect ◽

Missing Not At Random ◽

Missing Completely At Random ◽

Sensitivity Approach

In any longitudinal study, a dropout before the final timepoint can rarely be avoided. The chosen dropout model is commonly one of these types: Missing Completely at Random (MCAR), Missing at Random (MAR), Missing Not at Random (MNAR), and Shared Parameter (SP). In this paper we estimate the parameters of the longitudinal model for simulated data and real data using the Linear Mixed Effect (LME) method. We investigate the consequences of misspecifying the missingness mechanism by deriving the so-called least false values. These are the values the parameter estimates converge to, when the assumptions may be wrong. The knowledge of the least false values allows us to conduct a sensitivity analysis, which is illustrated. This method provides an alternative to a local misspecification sensitivity procedure, which has been developed for likelihood-based analysis. We compare the results obtained by the method proposed with the results found by using the local misspecification method. We apply the local misspecification and least false methods to estimate the bias and sensitivity of parameter estimates for a clinical trial example.

Download Full-text

Multiple Imputation with Missing Indicators as Proxies for Unmeasured Variables: Simulation Study

10.21203/rs.3.rs-24268/v3 ◽

2020 ◽

Author(s):

Matthew Sperrin ◽

Glen P. Martin

Keyword(s):

Missing Data ◽

Multiple Imputation ◽

Simulation Study ◽

Causal Effect ◽

Missing At Random ◽

Directed Acyclic Graphs ◽

Missing Not At Random ◽

Routinely Collected Health Data ◽

Effect Estimation ◽

Minimal Bias

Abstract Background : Within routinely collected health data, missing data for an individual might provide useful information in itself. This occurs, for example, in the case of electronic health records, where the presence or absence of data is informative. While the naive use of missing indicators to try to exploit such information can introduce bias, its use in conjunction with multiple imputation may unlock the potential value of missingness to reduce bias in causal effect estimation, particularly in missing not at random scenarios and where missingness might be associated with unmeasured confounders. Methods: We conducted a simulation study to determine when the use of a missing indicator, combined with multiple imputation, would reduce bias for causal effect estimation, under a range of scenarios including unmeasured variables, missing not at random, and missing at random mechanisms. We use directed acyclic graphs and structural models to elucidate a variety of causal structures of interest. We handled missing data using complete case analysis, and multiple imputation with and without missing indicator terms. Results: We find that multiple imputation combined with a missing indicator gives minimal bias for causal effect estimation in most scenarios. In particular the approach: 1) does not introduce bias in missing (completely) at random scenarios; 2)reduces bias in missing not at random scenarios where the missing mechanism depends on the missing variable itself; and 3) may reduce or increase bias when unmeasured confounding is present. Conclusion : In the presence of missing data, careful use of missing indicators, combined with multiple imputation, can improve causal effect estimation when missingness is informative, and is not detrimental when missingness is at random.

Download Full-text

Using the CES-D scale in a large cohort study and dealing with missing data: Application to the French E3N cohort

European Psychiatry ◽

10.1016/s0924-9338(11)72279-9 ◽

2011 ◽

Vol 26 (S2) ◽

pp. 572-572

Author(s):

N. Resseguier ◽

H. Verdoux ◽

F. Clavel-Chapelon ◽

X. Paoletti

Keyword(s):

Sensitivity Analysis ◽

Missing Data ◽

Multiple Imputation ◽

Missing Values ◽

Large Population ◽

Missing At Random ◽

Population Based ◽

Missing Value ◽

Perform Sensitivity Analysis ◽

The Impact

IntroductionThe CES-D scale is commonly used to assess depressive symptoms (DS) in large population-based studies. Missing values in items of the scale may create biases.ObjectivesTo explore reasons for not completing items of the CES-D scale and to perform sensitivity analysis of the prevalence of DS to assess the impact of different missing data hypotheses.Methods71412 women included in the French E3N cohort returned in 2005 a questionnaire containing the CES-D scale. 45% presented at least one missing value in the scale. An interview study was carried out on a random sample of 204 participants to examine the different hypotheses for the missing value mechanism. The prevalence of DS was estimated according to different methods for handling missing values: complete cases analysis, single imputation, multiple imputation under MAR (missing at random) and MNAR (missing not at random) assumptions.ResultsThe interviews showed that participants were not embarrassed to fill in questions about DS. Potential reasons of nonresponse were identified. MAR and MNAR hypotheses remained plausible and were explored.Among complete responders, the prevalence of DS was 26.1%. After multiple imputation under MAR assumption, it was 28.6%, 29.8% and 31.7% among women presenting up to 4, to 10 and to 20 missing values, respectively. The estimates were robust after applying various scenarios of MNAR data for the sensitivity analysis.ConclusionsThe CES-D scale can easily be used to assess DS in large cohorts. Multiple imputation under MAR assumption allows to reliably handle missing values.

Download Full-text