scholarly journals A selection modelling approach to analysing missing data of liver Cirrhosis patients

2016 ◽  
Vol 53 (2) ◽  
pp. 83-103 ◽  
Author(s):  
Dilip C. Nath ◽  
Ramesh K. Vishwakarma ◽  
Atanu Bhattacharjee

AbstractMethods for dealing with missing data in clinical trials have received increased attention from the regulators and practitioners in the pharmaceutical industry over the last few years. Consideration of missing data in a study is important as they can lead to substantial biases and have an impact on overall statistical power. This problem may be caused by patients dropping before completion of the study. The new guidelines of the International Conference on Harmonization place great emphasis on the importance of carefully choosing primary analysis methods based on clearly formulated assumptions regarding the missingness mechanism. The reason for dropout or withdrawal would be either related to the trial (e.g. adverse event, death, unpleasant study procedures, lack of improvement) or unrelated to the trial (e.g. moving away, unrelated disease). We applied selection models on liver cirrhosis patient data to analyse the treatment efficiency comparing the surgery of liver cirrhosis patients with consenting for participation HFLPC (Human Fatal Liver Progenitor Cells) infusion with surgery alone. It was found that comparison between treatment conditions when missing values are ignored potentially leads to biased conclusions.

Marketing ZFP ◽  
2019 ◽  
Vol 41 (4) ◽  
pp. 21-32
Author(s):  
Dirk Temme ◽  
Sarah Jensen

Missing values are ubiquitous in empirical marketing research. If missing data are not dealt with properly, this can lead to a loss of statistical power and distorted parameter estimates. While traditional approaches for handling missing data (e.g., listwise deletion) are still widely used, researchers can nowadays choose among various advanced techniques such as multiple imputation analysis or full-information maximum likelihood estimation. Due to the available software, using these modern missing data methods does not pose a major obstacle. Still, their application requires a sound understanding of the prerequisites and limitations of these methods as well as a deeper understanding of the processes that have led to missing values in an empirical study. This article is Part 1 and first introduces Rubin’s classical definition of missing data mechanisms and an alternative, variable-based taxonomy, which provides a graphical representation. Secondly, a selection of visualization tools available in different R packages for the description and exploration of missing data structures is presented.


2018 ◽  
Author(s):  
Kieu Trinh Do ◽  
Simone Wahl ◽  
Johannes Raffler ◽  
Sophie Molnos ◽  
Michael Laimighofer ◽  
...  

AbstractBACKGROUNDUntargeted mass spectrometry (MS)-based metabolomics data often contain missing values that reduce statistical power and can introduce bias in epidemiological studies. However, a systematic assessment of the various sources of missing values and strategies to handle these data has received little attention. Missing data can occur systematically, e.g. from run day-dependent effects due to limits of detection (LOD); or it can be random as, for instance, a consequence of sample preparation.METHODSWe investigated patterns of missing data in an MS-based metabolomics experiment of serum samples from the German KORA F4 cohort (n = 1750). We then evaluated 31 imputation methods in a simulation framework and biologically validated the results by applying all imputation approaches to real metabolomics data. We examined the ability of each method to reconstruct biochemical pathways from data-driven correlation networks, and the ability of the method to increase statistical power while preserving the strength of established genetically metabolic quantitative trait loci.RESULTSRun day-dependent LOD-based missing data accounts for most missing values in the metabolomics dataset. Although multiple imputation by chained equations (MICE) performed well in many scenarios, it is computationally and statistically challenging. K-nearest neighbors (KNN) imputation on observations with variable pre-selection showed robust performance across all evaluation schemes and is computationally more tractable.CONCLUSIONMissing data in untargeted MS-based metabolomics data occur for various reasons. Based on our results, we recommend that KNN-based imputation is performed on observations with variable pre-selection since it showed robust results in all evaluation schemes.Key messagesUntargeted MS-based metabolomics data show missing values due to both batch-specific LOD-based and non-LOD-based effects.Statistical evaluation of multiple imputation methods was conducted on both simulated and real datasets.Biological evaluation on real data assessed the ability of imputation methods to preserve statistical inference of biochemical pathways and correctly estimate effects of genetic variants on metabolite levels.KNN-based imputation on observations with variable pre-selection and K = 10 showed robust performance for all data scenarios across all evaluation schemes.


2019 ◽  
Vol 3 (Supplement_1) ◽  
pp. S972-S972
Author(s):  
Chen Kan ◽  
Won Hwa Kim ◽  
Ling Xu ◽  
Noelle L Fields

Abstract Background: Questionnaires are widely used to evaluate cognitive functions, depression, and loneliness of persons with dementia (PWDs). Successful assessment and treatment of dementia hinge on effective analysis of PWDs’ answers. However, many studies, especially pilot ones, are with small sample sizes. Further, most of them contain missing data as PWDs skip some study sessions due to their clinical conditions. Conventional imputation strategies are not well-suited as bias will be introduced because of insufficient samples. Method: A novel machine learning framework was developed based on harmonic analysis on graphs to robustly handle missing values. Participants were first embedded as nodes in the graph with edges derived by their similarities based on demographic information, activities of daily living, etc. Then, questionnaire scores with missing values were regarded as a function on the nodes, and they were estimated based on spectral analysis of the graph with a smoothness constraint. The proposed approach was evaluated using data from our pilot study of dementia subjects (N=15) with 15% data missing. Result: A few complete variables (binary or ordinal) were available for all participants. For each variable, we randomly removed 5 scores to mimic missing values. With our approach, we could recover all missing values with 90% accuracy on average. We were also able to impute the actual missing values in the dataset within reasonable ranges. Conclusion: Our proposed approach imputes missing values with high accuracy despite the small sample size. The proposed approach will significantly boost statistical power of various small-scale studies with missing data.


2021 ◽  
Author(s):  
Zeeshan Hamid ◽  
Kip D. Zimmerman ◽  
Hector Guillen-Ahlers ◽  
Cun Li ◽  
Peter Nathanielsz ◽  
...  

Introduction: Reliable and effective label-free quantification (LFQ) analyses are dependent not only on the method of data acquisition in the mass spectrometer, but also on the downstream data processing, including software tools, query database, data normalization and imputation. In non-human primates (NHP), LFQ is challenging because the query databases for NHP are limited since the genomes of these species are not comprehensively annotated. This invariably results in limited discovery of proteins and associated Post Translational Modifications (PTMs) and a higher fraction of missing data points. While identification of fewer proteins and PTMs due to database limitations can negatively impact uncovering important and meaningful biological information, missing data also limits downstream analyses (e.g., multivariate analyses), decreases statistical power, biases statistical inference, and makes biological interpretation of the data more challenging. In this study we attempted to address both issues: first, we used the MetaMorphues proteomics search engine to counter the limits of NHP query databases and maximize the discovery of proteins and associated PTMs, and second, we evaluated different imputation methods for accurate data inference. Results: Using the MetaMorpheus proteomics search engine we obtained quantitative data for 1,622 proteins and 10,634 peptides including 58 different PTMs (biological, metal and artifacts) across a diverse age range of NHP brain frontal cortex. However, among the 1,622 proteins identified, only 293 proteins were quantified across all samples with no missing values, emphasizing the importance of implementing an accurate and statically valid imputation method to fill in missing data. In our imputation analysis we demonstrate that Single Imputation methods that borrow information from correlated proteins such as Generalized Ridge Regression (GRR), Random Forest (RF), local least squares (LLS), and a Bayesian Principal Component Analysis methods (BPCA), are able to estimate missing protein abundance values with great accuracy. Conclusions: Overall, this study offers a detailed comparative analysis of LFQ data generated in NHP and proposes strategies for improved LFQ in NHP proteomics data.


2021 ◽  
Author(s):  
Corinne Jamoul ◽  
Laurence Collette ◽  
Elisabeth Coart ◽  
Koenraad D’Hollander ◽  
Tomasz Burzykowski ◽  
...  

Abstract Missing data may lead to loss of statistical power and introduce bias in clinical trials. The ongoing Covid-19 pandemic has had a profound impact on patient health care and on the conduct of cancer clinical trials. Restricted access to sites, medication and evaluations brings challenges to the analysis of clinical trials due to missing data. Although several endpoints may be affected, progression-free survival (PFS) is of major concern, given its frequent use as primary endpoint in advanced cancer and the fact that missed radiographic assessments are to be expected. If patients with progression have delayed radiographic assessment due to the pandemic, there is controversy between censoring at the last visit prior to a shutdown period or ascribing the progression date to the day the assessment is eventually done after the end of the shutdown. The recent introduction of the estimand framework creates an opportunity to define more precisely the target of estimation and ensure alignment between the scientific question and the statistical analysis. Two basic approaches can be considered for handling missing tumor scans due to the pandemic: a “treatment policy” strategy, which consists in ascribing events to the time they are observed, and a “hypothetical” approach of censoring patients with events during the shutdown period at the last assessment prior to that period. In this article, we show through simulations how these two approaches may affect the overall power of a study and bias the estimated treatment effect and median PFS estimates. As a general rule, we suggest that the treatment policy approach, which conforms with the intent-to-treat principle, should be the primary analysis in order to avoid unnecessary loss of power and minimize bias in median PFS estimates.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Rahi Jain ◽  
Wei Xu

Abstract Background Developing statistical and machine learning methods on studies with missing information is a ubiquitous challenge in real-world biological research. The strategy in literature relies on either removing the samples with missing values like complete case analysis (CCA) or imputing the information in the samples with missing values like predictive mean matching (PMM) such as MICE. Some limitations of these strategies are information loss and closeness of the imputed values with the missing values. Further, in scenarios with piecemeal medical data, these strategies have to wait to complete the data collection process to provide a complete dataset for statistical models. Method and results This study proposes a dynamic model updating (DMU) approach, a different strategy to develop statistical models with missing data. DMU uses only the information available in the dataset to prepare the statistical models. DMU segments the original dataset into small complete datasets. The study uses hierarchical clustering to segment the original dataset into small complete datasets followed by Bayesian regression on each of the small complete datasets. Predictor estimates are updated using the posterior estimates from each dataset. The performance of DMU is evaluated by using both simulated data and real studies and show better results or at par with other approaches like CCA and PMM. Conclusion DMU approach provides an alternative to the existing approaches of information elimination and imputation in processing the datasets with missing values. While the study applied the approach for continuous cross-sectional data, the approach can be applied to longitudinal, categorical and time-to-event biological data.


Author(s):  
Ahmad R. Alsaber ◽  
Jiazhu Pan ◽  
Adeeba Al-Hurban 

In environmental research, missing data are often a challenge for statistical modeling. This paper addressed some advanced techniques to deal with missing values in a data set measuring air quality using a multiple imputation (MI) approach. MCAR, MAR, and NMAR missing data techniques are applied to the data set. Five missing data levels are considered: 5%, 10%, 20%, 30%, and 40%. The imputation method used in this paper is an iterative imputation method, missForest, which is related to the random forest approach. Air quality data sets were gathered from five monitoring stations in Kuwait, aggregated to a daily basis. Logarithm transformation was carried out for all pollutant data, in order to normalize their distributions and to minimize skewness. We found high levels of missing values for NO2 (18.4%), CO (18.5%), PM10 (57.4%), SO2 (19.0%), and O3 (18.2%) data. Climatological data (i.e., air temperature, relative humidity, wind direction, and wind speed) were used as control variables for better estimation. The results show that the MAR technique had the lowest RMSE and MAE. We conclude that MI using the missForest approach has a high level of accuracy in estimating missing values. MissForest had the lowest imputation error (RMSE and MAE) among the other imputation methods and, thus, can be considered to be appropriate for analyzing air quality data.


Author(s):  
Maria Lucia Parrella ◽  
Giuseppina Albano ◽  
Cira Perna ◽  
Michele La Rocca

AbstractMissing data reconstruction is a critical step in the analysis and mining of spatio-temporal data. However, few studies comprehensively consider missing data patterns, sample selection and spatio-temporal relationships. To take into account the uncertainty in the point forecast, some prediction intervals may be of interest. In particular, for (possibly long) missing sequences of consecutive time points, joint prediction regions are desirable. In this paper we propose a bootstrap resampling scheme to construct joint prediction regions that approximately contain missing paths of a time components in a spatio-temporal framework, with global probability $$1-\alpha $$ 1 - α . In many applications, considering the coverage of the whole missing sample-path might appear too restrictive. To perceive more informative inference, we also derive smaller joint prediction regions that only contain all elements of missing paths up to a small number k of them with probability $$1-\alpha $$ 1 - α . A simulation experiment is performed to validate the empirical performance of the proposed joint bootstrap prediction and to compare it with some alternative procedures based on a simple nominal coverage correction, loosely inspired by the Bonferroni approach, which are expected to work well standard scenarios.


Agriculture ◽  
2021 ◽  
Vol 11 (8) ◽  
pp. 727
Author(s):  
Yingpeng Fu ◽  
Hongjian Liao ◽  
Longlong Lv

UNSODA, a free international soil database, is very popular and has been used in many fields. However, missing soil property data have limited the utility of this dataset, especially for data-driven models. Here, three machine learning-based methods, i.e., random forest (RF) regression, support vector (SVR) regression, and artificial neural network (ANN) regression, and two statistics-based methods, i.e., mean and multiple imputation (MI), were used to impute the missing soil property data, including pH, saturated hydraulic conductivity (SHC), organic matter content (OMC), porosity (PO), and particle density (PD). The missing upper depths (DU) and lower depths (DL) for the sampling locations were also imputed. Before imputing the missing values in UNSODA, a missing value simulation was performed and evaluated quantitatively. Next, nonparametric tests and multiple linear regression were performed to qualitatively evaluate the reliability of these five imputation methods. Results showed that RMSEs and MAEs of all features fluctuated within acceptable ranges. RF imputation and MI presented the lowest RMSEs and MAEs; both methods are good at explaining the variability of data. The standard error, coefficient of variance, and standard deviation decreased significantly after imputation, and there were no significant differences before and after imputation. Together, DU, pH, SHC, OMC, PO, and PD explained 91.0%, 63.9%, 88.5%, 59.4%, and 90.2% of the variation in BD using RF, SVR, ANN, mean, and MI, respectively; and this value was 99.8% when missing values were discarded. This study suggests that the RF and MI methods may be better for imputing the missing data in UNSODA.


Author(s):  
Michiel J. van Esdonk ◽  
Jasper Stevens

AbstractThe quantitative description of individual observations in non-linear mixed effects models over time is complicated when the studied biomarker has a pulsatile release (e.g. insulin, growth hormone, luteinizing hormone). Unfortunately, standard non-linear mixed effects population pharmacodynamic models such as turnover and precursor response models (with or without a cosinor component) are unable to quantify these complex secretion profiles over time. In this study, the statistical power of standard statistical methodology such as 6 post-dose measurements or the area under the curve from 0 to 12 h post-dose on simulated dense concentration–time profiles of growth hormone was compared to a deconvolution-analysis-informed modelling approach in different simulated scenarios. The statistical power of the deconvolution-analysis-informed approach was determined with a Monte-Carlo Mapped Power analysis. Due to the high level of intra- and inter-individual variability in growth hormone concentrations over time, regardless of the simulated effect size, only the deconvolution-analysis informed approach reached a statistical power of more than 80% with a sample size of less than 200 subjects per cohort. Furthermore, the use of this deconvolution-analysis-informed modelling approach improved the description of the observations on an individual level and enabled the quantification of a drug effect to be used for subsequent clinical trial simulations.


Sign in / Sign up

Export Citation Format

Share Document