Alternative expectation approaches for expectation-maximization missing data imputations in cox regression

Author(s):  
Fatih Sağlam ◽  
Tuba Şanlı ◽  
Mehmet Ali Cengiz ◽  
Yüksel Terzi
2012 ◽  
Vol 30 (26) ◽  
pp. 3297-3303 ◽  
Author(s):  
Joseph G. Ibrahim ◽  
Haitao Chu ◽  
Ming-Hui Chen

Missing data are a prevailing problem in any type of data analyses. A participant variable is considered missing if the value of the variable (outcome or covariate) for the participant is not observed. In this article, various issues in analyzing studies with missing data are discussed. Particularly, we focus on missing response and/or covariate data for studies with discrete, continuous, or time-to-event end points in which generalized linear models, models for longitudinal data such as generalized linear mixed effects models, or Cox regression models are used. We discuss various classifications of missing data that may arise in a study and demonstrate in several situations that the commonly used method of throwing out all participants with any missing data may lead to incorrect results and conclusions. The methods described are applied to data from an Eastern Cooperative Oncology Group phase II clinical trial of liver cancer and a phase III clinical trial of advanced non–small-cell lung cancer. Although the main area of application discussed here is cancer, the issues and methods we discuss apply to any type of study.


2018 ◽  
Author(s):  
Seyed Mahmood Taghavi-Shahri ◽  
Alessandro Fassò ◽  
Behzad Mahaki ◽  
Heresh Amini

AbstractGraphical AbstractLand use regression (LUR) has been widely applied in epidemiologic research for exposure assessment. In this study, for the first time, we aimed to develop a spatiotemporal LUR model using Distributed Space Time Expectation Maximization (D-STEM). This spatiotemporal LUR model examined with daily particulate matter ≤ 2.5 μm (PM2.5) within the megacity of Tehran, capital of Iran. Moreover, D-STEM missing data imputation was compared with mean substitution in each monitoring station, as it is equivalent to ignoring of missing data, which is common in LUR studies that employ regulatory monitoring stations’ data. The amount of missing data was 28% of the total number of observations, in Tehran in 2015. The annual mean of PM2.5 concentrations was 33 μg/m3. Spatiotemporal R-squared of the D-STEM final daily LUR model was 78%, and leave-one-out cross-validation (LOOCV) R-squared was 66%. Spatial R-squared and LOOCV R-squared were 89% and 72%, respectively. Temporal R-squared and LOOCV R-squared were 99.5% and 99.3%, respectively. Mean absolute error decreased 26% in imputation of missing data by using the D-STEM final LUR model instead of mean substitution. This study reveals competence of the D-STEM software in spatiotemporal missing data imputation, estimation of temporal trend, and mapping of small scale (20 × 20 meters) within-city spatial variations, in the LUR context. The estimated PM2.5 concentrations maps could be used in future studies on short- and/or long-term health effects. Overall, we suggest using D-STEM capabilities in increasing LUR studies that employ data of regulatory network monitoring stations.Highlights-First Land Use Regression using D-STEM, a recently introduced statistical software-Assess D-STEM in spatiotemporal modeling, mapping, and missing data imputation-Estimate high resolution (20×20 m) daily maps for exposure assessment in a megacity-Provide both short- and long-term exposure assessment for epidemiological studies


2021 ◽  
Vol 2021 ◽  
pp. 1-9
Author(s):  
Peyman Almasinejad ◽  
Amin Golabpour ◽  
Mohammad Reza Mollakhalili Meybodi ◽  
Kamal Mirzaie ◽  
Ahmad Khosravi

Missing data occurs in all research, especially in medical studies. Missing data is the situation in which a part of research data has not been reported. This will result in the incompatibility of the sample and the population and misguided conclusions. Missing data is usual in research, and the extent of it will determine how misinterpreted the conclusions will be. All methods of parameter estimation and prediction models are based on the assumption that the data are complete. Extensive missing data will result in false predictions and increased bias. In the present study, a novel method has been proposed for the imputation of medical missing data. The method determines what algorithm is suitable for the imputation of missing data. To do so, a multiobjective particle swarm optimization algorithm was used. The algorithm imputes the missing data in a way that if a prediction model is applied to the data, both specificity and sensitivity will be optimized. Our proposed model was evaluated using real data of gastric cancer and acute T-cell leukemia (ATLL). First, the model was then used to impute the missing data. Then, the missing data were imputed using deletion, average, expectation maximization, MICE, and missForest methods. Finally, the prediction model was applied for both imputed datasets. The accuracy of the prediction model for the first and the second imputation methods was 0.5 and 16.5, respectively. The novel imputation method was more accurate than similar algorithms like expectation maximization and MICE.


2015 ◽  
Vol 62 (2) ◽  
pp. 1231-1240 ◽  
Author(s):  
Kangkang Zhang ◽  
Ruben Gonzalez ◽  
Biao Huang ◽  
Guoli Ji

Author(s):  
NITESH KUMAR ◽  
ONDŘEJ KUŽELKA ◽  
LUC DE RAEDT

Abstract Relational autocompletion is the problem of automatically filling out some missing values in multi-relational data. We tackle this problem within the probabilistic logic programming framework of Distributional Clauses (DCs), which supports both discrete and continuous probability distributions. Within this framework, we introduce DiceML – an approach to learn both the structure and the parameters of DC programs from relational data (with possibly missing data). To realize this, DiceML integrates statistical modeling and DCs with rule learning. The distinguishing features of DiceML are that it (1) tackles autocompletion in relational data, (2) learns DCs extended with statistical models, (3) deals with both discrete and continuous distributions, (4) can exploit background knowledge, and (5) uses an expectation–maximization-based (EM) algorithm to cope with missing data. The empirical results show the promise of the approach, even when there is missing data.


Sign in / Sign up

Export Citation Format

Share Document