An empirical evaluation of the impact of missing data on treatment effect

Missing not at random (MNAR) modeling for non-ignorable missing responses usually assumes that the latent variable distribution is a bivariate normal distribution. Such an assumption is rarely verified and often employed as a standard in practice. Recent studies for “complete” item responses (i.e., no missing data) have shown that ignoring the nonnormal distribution of a unidimensional latent variable, especially skewed or bimodal, can yield biased estimates and misleading conclusion. However, dealing with the bivariate nonnormal latent variable distribution with present MNAR data has not been looked into. This article proposes to extend unidimensional empirical histogram and Davidian curve methods to simultaneously deal with nonnormal latent variable distribution and MNAR data. A simulation study is carried out to demonstrate the consequence of ignoring bivariate nonnormal distribution on parameter estimates, followed by an empirical analysis of “don’t know” item responses. The results presented in this article show that examining the assumption of bivariate nonnormal latent variable distribution should be considered as a routine for MNAR data to minimize the impact of nonnormality on parameter estimates.

Download Full-text

Estimating and reporting treatment effects in clinical trials for weight management: using estimands to interpret effects of intercurrent events and missing data

International Journal of Obesity ◽

10.1038/s41366-020-00733-x ◽

2021 ◽

Cited By ~ 1

Author(s):

Sean Wharton ◽

Arne Astrup ◽

Lars Endahl ◽

Michael E. J. Lean ◽

Altynai Satylganova ◽

...

Keyword(s):

Clinical Trials ◽

Missing Data ◽

Weight Management ◽

Treatment Effect ◽

Repeated Measures ◽

Treatment Effects ◽

Randomized Clinical Trials ◽

Weight Losses ◽

The Difference ◽

Treatment Of Obesity

AbstractIn the approval process for new weight management therapies, regulators typically require estimates of effect size. Usually, as with other drug evaluations, the placebo-adjusted treatment effect (i.e., the difference between weight losses with pharmacotherapy and placebo, when given as an adjunct to lifestyle intervention) is provided from data in randomized clinical trials (RCTs). At first glance, this may seem appropriate and straightforward. However, weight loss is not a simple direct drug effect, but is also mediated by other factors such as changes in diet and physical activity. Interpreting observed differences between treatment arms in weight management RCTs can be challenging; intercurrent events that occur after treatment initiation may affect the interpretation of results at the end of treatment. Utilizing estimands helps to address these uncertainties and improve transparency in clinical trial reporting by better matching the treatment-effect estimates to the scientific and/or clinical questions of interest. Estimands aim to provide an indication of trial outcomes that might be expected in the same patients under different conditions. This article reviews how intercurrent events during weight management trials can influence placebo-adjusted treatment effects, depending on how they are accounted for and how missing data are handled. The most appropriate method for statistical analysis is also discussed, including assessment of the last observation carried forward approach, and more recent methods, such as multiple imputation and mixed models for repeated measures. The use of each of these approaches, and that of estimands, is discussed in the context of the SCALE phase 3a and 3b RCTs evaluating the effect of liraglutide 3.0 mg for the treatment of obesity.

Download Full-text

Impact of Dataset Size on Classification Performance: An Empirical Evaluation in the Medical Domain

Applied Sciences ◽

10.3390/app11020796 ◽

2021 ◽

Vol 11 (2) ◽

pp. 796

Author(s):

Alhanoof Althnian ◽

Duaa AlSaeed ◽

Heyam Al-Baity ◽

Amani Samha ◽

Alanoud Bin Dris ◽

...

Keyword(s):

Empirical Evaluation ◽

Classification Performance ◽

Support Vector ◽

Robust Model ◽

Original Distribution ◽

C4.5 Decision Tree ◽

Dataset Size ◽

Overall Performance ◽

Medical Domain ◽

The Impact

Dataset size is considered a major concern in the medical domain, where lack of data is a common occurrence. This study aims to investigate the impact of dataset size on the overall performance of supervised classification models. We examined the performance of six widely-used models in the medical field, including support vector machine (SVM), neural networks (NN), C4.5 decision tree (DT), random forest (RF), adaboost (AB), and naïve Bayes (NB) on eighteen small medical UCI datasets. We further implemented three dataset size reduction scenarios on two large datasets and analyze the performance of the models when trained on each resulting dataset with respect to accuracy, precision, recall, f-score, specificity, and area under the ROC curve (AUC). Our results indicated that the overall performance of classifiers depend on how much a dataset represents the original distribution rather than its size. Moreover, we found that the most robust model for limited medical data is AB and NB, followed by SVM, and then RF and NN, while the least robust model is DT. Furthermore, an interesting observation is that a robust machine learning model to limited dataset does not necessary imply that it provides the best performance compared to other models.

Download Full-text

A Framework for Analyzing the Impact of Missing Data in Predictive Models

Proceedings of the 29th ACM International Conference on Information & Knowledge Management ◽

10.1145/3340531.3412129 ◽

2020 ◽

Author(s):

Fabiola Santore ◽

Eduardo C. de Almeida ◽

Wagner H. Bonat ◽

Eduardo H. M. Pena ◽

Luiz Eduardo S. de Oliveira

Keyword(s):

Missing Data ◽

Predictive Models ◽

The Impact

Download Full-text

Maximum Likelihood Estimation of the VAR(1) Model Parameters with Missing Observations

Mathematical Problems in Engineering ◽

10.1155/2013/848120 ◽

2013 ◽

Vol 2013 ◽

pp. 1-13 ◽

Cited By ~ 2

Author(s):

Helena Mouriño ◽

Maria Isabel Barão

Keyword(s):

Stochastic Process ◽

Missing Data ◽

Maximum Likelihood ◽

Moving Average ◽

Practical Importance ◽

Likelihood Estimation ◽

Model Parameters ◽

Missing Observations ◽

Data Set ◽

The Impact

Missing-data problems are extremely common in practice. To achieve reliable inferential results, we need to take into account this feature of the data. Suppose that the univariate data set under analysis has missing observations. This paper examines the impact of selecting an auxiliary complete data set—whose underlying stochastic process is to some extent interdependent with the former—to improve the efficiency of the estimators for the relevant parameters of the model. The Vector AutoRegressive (VAR) Model has revealed to be an extremely useful tool in capturing the dynamics of bivariate time series. We propose maximum likelihood estimators for the parameters of the VAR(1) Model based on monotone missing data pattern. Estimators’ precision is also derived. Afterwards, we compare the bivariate modelling scheme with its univariate counterpart. More precisely, the univariate data set with missing observations will be modelled by an AutoRegressive Moving Average (ARMA(2,1)) Model. We will also analyse the behaviour of the AutoRegressive Model of order one, AR(1), due to its practical importance. We focus on the mean value of the main stochastic process. By simulation studies, we conclude that the estimator based on the VAR(1) Model is preferable to those derived from the univariate context.

Download Full-text

When does the use of individual patient data in network meta-analysis make a difference? A simulation study

BMC Medical Research Methodology ◽

10.1186/s12874-020-01198-2 ◽

2021 ◽

Vol 21 (1) ◽

Author(s):

Steve Kanters ◽

Mohammad Ehsanul Karim ◽

Kristian Thorlund ◽

Aslam Anis ◽

Nick Bansback

Keyword(s):

Simulation Study ◽

Treatment Effect ◽

Mean Squared Error ◽

Individual Patient Data ◽

Meta Analysis ◽

Model Performance ◽

Patient Data ◽

Meta Regression ◽

Treatment Comparisons ◽

The Impact

Abstract Background The use of individual patient data (IPD) in network meta-analyses (NMA) is rapidly growing. This study aimed to determine, through simulations, the impact of select factors on the validity and precision of NMA estimates when combining IPD and aggregate data (AgD) relative to using AgD only. Methods Three analysis strategies were compared via simulations: 1) AgD NMA without adjustments (AgD-NMA); 2) AgD NMA with meta-regression (AgD-NMA-MR); and 3) IPD-AgD NMA with meta-regression (IPD-NMA). We compared 108 parameter permutations: number of network nodes (3, 5 or 10); proportion of treatment comparisons informed by IPD (low, medium or high); equal size trials (2-armed with 200 patients per arm) or larger IPD trials (500 patients per arm); sparse or well-populated networks; and type of effect-modification (none, constant across treatment comparisons, or exchangeable). Data were generated over 200 simulations for each combination of parameters, each using linear regression with Normal distributions. To assess model performance and estimate validity, the mean squared error (MSE) and bias of treatment-effect and covariate estimates were collected. Standard errors (SE) and percentiles were used to compare estimate precision. Results Overall, IPD-NMA performed best in terms of validity and precision. The median MSE was lower in the IPD-NMA in 88 of 108 scenarios (similar results otherwise). On average, the IPD-NMA median MSE was 0.54 times the median using AgD-NMA-MR. Similarly, the SEs of the IPD-NMA treatment-effect estimates were 1/5 the size of AgD-NMA-MR SEs. The magnitude of superior validity and precision of using IPD-NMA varied across scenarios and was associated with the amount of IPD. Using IPD in small or sparse networks consistently led to improved validity and precision; however, in large/dense networks IPD tended to have negligible impact if too few IPD were included. Similar results also apply to the meta-regression coefficient estimates. Conclusions Our simulation study suggests that the use of IPD in NMA will considerably improve the validity and precision of estimates of treatment effect and regression coefficients in the most NMA IPD data-scenarios. However, IPD may not add meaningful validity and precision to NMAs of large and dense treatment networks when negligible IPD are used.

Download Full-text

The impact of auction choice on revenue in treasury bill auctions – An empirical evaluation

International Journal of Industrial Organization ◽

10.1016/j.ijindorg.2017.05.005 ◽

2017 ◽

Vol 53 ◽

pp. 215-239 ◽

Cited By ~ 6

Author(s):

Daniel Marszalec

Keyword(s):

Empirical Evaluation ◽

Treasury Bill ◽

The Impact

Download Full-text

No evidence that frailty modifies the positive impact of antihypertensive treatment in very elderly people: an investigation of the impact of frailty upon treatment effect in the HYpertension in the Very Elderly Trial (HYVET) study, a double-blind, placebo-controlled study of antihypertensives in people with hypertension aged 80 and over

BMC Medicine ◽

10.1186/s12916-015-0328-1 ◽

2015 ◽

Vol 13 (1) ◽

Cited By ~ 133

Author(s):

Jane Warwick ◽

Emanuela Falaschetti ◽

Kenneth Rockwood ◽

Arnold Mitnitski ◽

Lutgarde Thijs ◽

...

Keyword(s):

Treatment Effect ◽

Antihypertensive Treatment ◽

Positive Impact ◽

Controlled Study ◽

80 And Over ◽

Very Elderly ◽

Aged 80 And Over ◽

Double Blind ◽

Double Blind Placebo ◽

The Impact

Download Full-text

Impact of strong El Niño events on river discharge in South America

10.5194/egusphere-egu21-10383 ◽

2021 ◽

Author(s):

Markus Deppner ◽

Bedartha Goswami

Keyword(s):

Machine Learning ◽

Missing Data ◽

South America ◽

River Discharge ◽

Amazon Basin ◽

Missing Values ◽

Southern Oscillation ◽

Enso Events ◽

Streamflow Data ◽

The Impact

<p>The impact of the El Ni&#241;o Southern Oscillation (ENSO) on rivers are well known, but most existing studies involving streamflow data are severely limited by data coverage. Time series of gauging stations fade in and out over time, which makes hydrological large scale and long time analysis or studies of rarely occurring extreme events challenging. Here, we use a machine learning approach to infer missing streamflow data based on temporal correlations of stations with missing values to others with data. By using 346 stations, from the &#8220;Global Streamflow Indices and Metadata archive&#8221; (GSIM), that initially cover the 40 year timespan in conjunction with Gaussian processes we were able to extend our data by estimating missing data for an additional 646 stations, allowing us to include a total of 992 stations. We then investigate the impact of the 6 strongest El Ni&#241;o (EN) events on rivers in South America between 1960 and 2000. Our analysis shows a strong correlation between ENSO events and extreme river dynamics in the southeast of Brazil, Carribean South America and parts of the Amazon basin. Furthermore we see a peak in the number of stations showing maximum river discharge all over Brazil during the EN of 1982/83 which has been linked to severe floods in the east of Brazil, parts of Uruguay and Paraguay. However EN events in other years with similar intensity did not evoke floods with such magnitude and therefore the additional drivers of the 1982/83&#160; floods need further investigation. By using machine learning methods to infer data for gauging stations with missing data we were able to extend our data by almost three-fold, revealing a possible heavier and spatially larger impact of the 1982/83 EN on South America's hydrology than indicated in literature.</p>

Download Full-text

Bounding Treatment Effects with Contaminated and Censored Data: Assessing the Impact of Early Childbearing on Children

The B E Journal of Economic Analysis & Policy ◽

10.1515/1538-0637.1119 ◽

2005 ◽

Vol 5 (1) ◽

Author(s):

Charles H Mullin

Keyword(s):

Censored Data ◽

Treatment Effect ◽

Instrumental Variable ◽

Treatment Effects ◽

Child Outcomes ◽

Well Being ◽

Teenage Mothers ◽

Age At First Birth ◽

First Birth ◽

The Impact

AbstractEmpirical researchers commonly invoke instrumental variable (IV) assumptions to identify treatment effects. This paper considers what can be learned under two specific violations of those assumptions: contaminated and corrupted data. Either of these violations prevents point identification, but sharp bounds of the treatment effect remain feasible. In an applied example, random miscarriages are an IV for women’s age at first birth. However, the inability to separate random miscarriages from behaviorally induced miscarriages (those caused by smoking and drinking) results in a contaminated sample. Furthermore, censored child outcomes produce a corrupted sample. Despite these limitations, the bounds demonstrate that delaying the age at first birth for the current population of non-black teenage mothers reduces their first-born child’s well-being.

Download Full-text