scholarly journals The Fitting Optimization Path Analysis on Scale Missing Data: Based on the 507 Patients of Poststroke Depression Measured by SDS

2022 ◽  
Vol 2022 ◽  
pp. 1-8
Author(s):  
Xiaoying Lv ◽  
Ruonan Zhao ◽  
Tongsheng Su ◽  
Liyun He ◽  
Rui Song ◽  
...  

Objective. To explore the optimal fitting path of missing data of the Scale to make the fitting data close to the real situation of patients’ data. Methods. Based on the complete data set of the SDS of 507 patients with stroke, the data simulation sets of Missing Completely at Random (MCAR), Missing at Random (MAR), and Missing Not at Random (MNAR) were constructed by R software, respectively, with missing rates of 5%, 10%, 15%, 20%, 25%, 30%, 35%, and 40% under three missing mechanisms. Mean substitution (MS), random forest regression (RFR), and predictive mean matching (PMM) were used to fit the data. Root mean square error (RMSE), the width of 95% confidence intervals (95% CI), and Spearman correlation coefficient (SCC) were used to evaluate the fitting effect and determine the optimal fitting path. Results. when dealing with the problem of missing data in scales, the optimal fitting path is ① under the MCAR deletion mechanism, when the deletion proportion is less than 20%, the MS method is the most convenient; when the missing ratio is greater than 20%, RFR algorithm is the best fitting method. ② Under the Mar mechanism, when the deletion ratio is less than 35%, the MS method is the most convenient. When the deletion ratio is greater than 35%, RFR has a better correlation. ③ Under the mechanism of MNAR, RFR is the best data fitting method, especially when the missing proportion is greater than 30%. In reality, when the deletion ratio is small, the complete case deletion method is the most commonly used, but the RFR algorithm can greatly expand the application scope of samples and save the cost of clinical research when the deletion ratio is less than 30%. The best way to deal with data missing should be based on the missing mechanism and proportion of actual data, and choose the best method between the statistical analysis ability of the research team, the effectiveness of the method, and the understanding of readers.

1997 ◽  
Vol 22 (4) ◽  
pp. 407-424 ◽  
Author(s):  
Alan L. Gross

The posterior distribution of the bivariate correlation ( ρxy) is analytically derived given a data set consisting N1 cases measured on both x and y, N2 cases measured only on x, and N3 cases measured only on y. The posterior distribution is shown to be a function of the subsample sizes, the sample correlation ( rxy) computed from the N1 complete cases, a set of four statistics which measure the extent to which the missing data are not missing completely at random, and the specified prior distribution for ρxy. A sampling study suggests that in small ( N = 20) and moderate ( N = 50) sized samples, posterior Bayesian interval estimates will dominate maximum likelihood based estimates in terms of coverage probability and expected interval widths when the prior distribution for ρxy is simply uniform on (0, 1). The advantage of the Bayesian method when more informative priors based on beta densities are employed is not as consistent.


2021 ◽  
Author(s):  
Trenton J. Davis ◽  
Tarek R. Firzli ◽  
Emily A. Higgins Keppler ◽  
Matt Richardson ◽  
Heather D. Bean

Missing data is a significant issue in metabolomics that is often neglected when conducting data pre-processing, particularly when it comes to imputation. This can have serious implications for downstream statistical analyses and lead to misleading or uninterpretable inferences. In this study, we aim to identify the primary types of missingness that affect untargeted metab-olomics data and compare strategies for imputation using two real-world comprehensive two-dimensional gas chromatog-raphy (GC×GC) data sets. We also present these goals in the context of experimental replication whereby imputation is con-ducted in a within-replicate-based fashion—the first description and evaluation of this strategy—and introduce an R package MetabImpute to carry out these analyses. Our results conclude that, in these two data sets, missingness was most likely of the missing at-random (MAR) and missing not-at-random (MNAR) types as opposed to missing completely at-random (MCAR). Gibbs sampler imputation and Random Forest gave the best results when imputing MAR and MNAR compared against single-value imputation (zero, minimum, mean, median, and half-minimum) and other more sophisticated approach-es (Bayesian principal components analysis and quantile regression imputation for left-censored data). When samples are replicated, within-replicate imputation approaches led to an increase in the reproducibility of peak quantification compared to imputation that ignores replication, suggesting that imputing with respect to replication may preserve potentially im-portant features in downstream analyses for biomarker discovery.


2020 ◽  
Author(s):  
Pietro Di Lena ◽  
Claudia Sala ◽  
Andrea Prodi ◽  
Christine Nardini

Abstract Background: High-throughput technologies enable the cost-effective collection and analysis of DNA methylation data throughout the human genome. This naturally entails missing values management that can complicate the analysis of the data. Several general and specific imputation methods are suitable for DNA methylation data. However, there are no detailed studies of their performances under different missing data mechanisms -(completely) at random or not- and different representations of DNA methylation levels (β and M-value). Results: We make an extensive analysis of the imputation performances of seven imputation methods on simulated missing completely at random (MCAR), missing at random (MAR) and missing not at random (MNAR) methylation data. We further consider imputation performances on the β- and M-value popular representations of methylation levels. Overall, β -values enable better imputation performances than M-values. Imputation accuracy is lower for mid-range β -values, while it is generally more accurate for values at the extremes of the β -value range. The MAR values distribution is on the average more dense in the mid-range in comparison to the expected β -value distribution. As a consequence, MAR values are on average harder to impute. Conclusions: The results of the analysis provide guidelines for the most suitable imputation approaches for DNA methylation data under different representations of DNA methylation levels and different missing data mechanisms.


2020 ◽  
Vol 18 (2) ◽  
pp. 2-6
Author(s):  
Thomas R. Knapp

Rubin (1976, and elsewhere) claimed that there are three kinds of “missingness”: missing completely at random; missing at random; and missing not at random. He gave examples of each. The article that now follows takes an opposing view by arguing that almost all missing data are missing not at random.


2020 ◽  
Vol 80 (5) ◽  
pp. 932-954 ◽  
Author(s):  
Jiaying Xiao ◽  
Okan Bulut

Large amounts of missing data could distort item parameter estimation and lead to biased ability estimates in educational assessments. Therefore, missing responses should be handled properly before estimating any parameters. In this study, two Monte Carlo simulation studies were conducted to compare the performance of four methods in handling missing data when estimating ability parameters. The methods were full-information maximum likelihood (FIML), zero replacement, and multiple imputation with chain equations utilizing classification and regression trees (MICE-CART) and random forest imputation (MICE-RFI). For the two imputation methods, missing responses were considered as a valid response category to enhance the accuracy of imputations. Bias, root mean square error, and the correlation between true ability parameters and estimated ability parameters were used to evaluate the accuracy of ability estimates for each method. Results indicated that FIML outperformed the other methods under most conditions. Zero replacement yielded accurate ability estimates when missing proportions were very high. The performances of MICE-CART and MICE-RFI were quite similar but these two methods appeared to be affected differently by the missing data mechanism. As the number of items increased and missing proportions decreased, all the methods performed better. In addition, the information on missing data could improve the performance of MICE-RFI and MICE-CART when the data set is sparse and the missing data mechanism is missing at random.


2007 ◽  
Vol 7 (3) ◽  
pp. 325-338 ◽  
Author(s):  
J. Scott Granberg-Rademacker

This article compares three approaches to handling missing data at the state level under three distinct conditions. Using Monte Carlo simulation experiments, I compare the results from a linear model using listwise deletion (LD), Markov Chain Monte Carlo with the Gibbs sampler algorithm (MCMC), and multiple imputation by chained equations (MICE) as approaches to dealing with different severity levels of missing data: missing completely at random (MCAR), missing at random (MAR), and nonignorable missingness (NI). I compare the results from each of these approaches under each condition for missing data to the results from the fully observed dataset. I conclude that the MICE algorithm performs best under most missing data conditions, MCMC provides the most stable parameter estimates across the missing data conditions (but often produced estimates that were moderately biased), and LD performs worst under most missing data conditions. I conclude with recommendations for handling missing data in state-level analysis.


2020 ◽  
Vol 9 (2) ◽  
pp. 755-763
Author(s):  
Shamihah Muhammad Ghazali ◽  
Norshahida Shaadan ◽  
Zainura Idrus

Missing values often occur in many data sets of various research areas. This has been recognized as data quality problem because missing values could affect the performance of analysis results. To overcome the problem, the incomplete data set need to be treated or replaced using imputation method. Thus, exploring missing values pattern must be conducted beforehand to determine a suitable method. This paper discusses on the application of data visualisation as a smart technique for missing data exploration aiming to increase understanding on missing data behaviour which include missing data mechanism (MCAR, MAR and MNAR), distribution pattern of missingness in terms of percentage as well as the gap size. This paper presents the application of several data visualisation tools from five R-packges such as visdat, VIM, ggplot2, Amelia and UpSetR for data missingness exploration.  For an illustration, based on an air quality data set in Malaysia, several graphics were produced and discussed to illustrate the contribution of the visualisation tools in providing input and the insight on the pattern of data missingness. Based on the results, it is shown that missing values in air quality data set of the chosen sites in Malaysia behave as missing at random (MAR) with small percentage of missingness  and do contain long gap size of  missingness.


2016 ◽  
Author(s):  
Lucas Jardim ◽  
Luis Mauricio Bini ◽  
José Alexandre Felizola Diniz-Filho ◽  
Fabricio Villalobos

SummaryGiven the prevalence of missing data on species’ traits – Raunkiaeran shorfall — and its importance for theoretical and empirical investigations, several methods have been proposed to fill sparse databases. Despite its advantages, imputation of missing data can introduce biases. Here, we evaluate the bias in descriptive statistics, model parameters, and phylogenetic signal estimation from imputed databases under different missing and imputing scenarios.We simulated coalescent phylogenies and traits under Brownian Motion and different Ornstein-Uhlenbeck evolutionary models. Missing values were created using three scenarios: missing completely at random, missing at random but phylogenetically structured and missing at random but correlated with some other variable. We considered four methods for handling missing data: delete missing values, imputation based on observed mean trait value, Phylogenetic Eigenvectors Maps and Multiple Imputation by Chained Equations. Finally, we assessed estimation errors of descriptive statistics (mean, variance), regression coefficient, Moran’s correlogram and Blomberg’s K of imputed traits.We found that percentage of missing data, missing mechanisms, Ornstein-Uhlenbeck strength and handling methods were important to define estimation errors. When data were missing completely at random, descriptive statistics were well estimated but Moran’s correlogram and Blomberg’s K were not well estimated, depending on handling methods. We also found that handling methods performed worse when data were missing at random, but phylogenetically structured. In this case adding phylogenetic information provided better estimates. Although the error caused by imputation was correlated with estimation errors, we found that such relationship is not linear with estimation errors getting larger as the imputation error increases.Imputed trait databases could bias ecological and evolutionary analyses. We advise researchers to share their raw data along with their imputed database, flagging imputed data and providing information on the imputation process. Thus, users can and should consider the pattern of missing data and then look for the best method to overcome this problem. In addition, we suggest the development of phylogenetic methods that consider imputation uncertainty, phylogenetic autocorrelation and preserve the level of phylogenetic signal of the original data.


2018 ◽  
Vol 79 (3) ◽  
pp. 495-511 ◽  
Author(s):  
Dee Duygu Cetin-Berber ◽  
Halil Ibrahim Sari ◽  
Anne Corinne Huggins-Manley

Routing examinees to modules based on their ability level is a very important aspect in computerized adaptive multistage testing. However, the presence of missing responses may complicate estimation of examinee ability, which may result in misrouting of individuals. Therefore, missing responses should be handled carefully. This study investigated multiple missing data methods in computerized adaptive multistage testing, including two imputation techniques, the use of full information maximum likelihood and the use of scoring missing data as incorrect. These methods were examined under the missing completely at random, missing at random, and missing not at random frameworks, as well as other testing conditions. Comparisons were made to baseline conditions where no missing data were present. The results showed that imputation and the full information maximum likelihood methods outperformed incorrect scoring methods in terms of average bias, average root mean square error, and correlation between estimated and true thetas.


2020 ◽  
Vol 35 (4) ◽  
pp. 589-614
Author(s):  
Melanie-Angela Neuilly ◽  
Ming-Li Hsieh ◽  
Alex Kigerl ◽  
Zachary K. Hamilton

Research on homicide missing data conventionally posits a Missing At Random pattern despite the relationship between missing data and clearance. The latter, however, cannot be satisfactorily modeled using variables traditionally available in homicide datasets. For this reason, it has been argued that missingness in homicide data follows a Nonignorable pattern instead. Hence, the use of multiple imputation strategies as recommended in the field for ignorable patterns would thus pose a threat to the validity of results obtained in such a way. This study examines missing data mechanisms by using a set of primary data collected in New Jersey. After comparing Listwise Deletion, Multiple Imputation, Propensity Score Matching, and Log-Multiplicative Association Models, our findings underscore that data in homicide datasets are indeed Missing Not At Random.


Sign in / Sign up

Export Citation Format

Share Document