missing completely at random
Recently Published Documents


TOTAL DOCUMENTS

55
(FIVE YEARS 19)

H-INDEX

16
(FIVE YEARS 1)

2022 ◽  
Vol 2022 ◽  
pp. 1-8
Author(s):  
Xiaoying Lv ◽  
Ruonan Zhao ◽  
Tongsheng Su ◽  
Liyun He ◽  
Rui Song ◽  
...  

Objective. To explore the optimal fitting path of missing data of the Scale to make the fitting data close to the real situation of patients’ data. Methods. Based on the complete data set of the SDS of 507 patients with stroke, the data simulation sets of Missing Completely at Random (MCAR), Missing at Random (MAR), and Missing Not at Random (MNAR) were constructed by R software, respectively, with missing rates of 5%, 10%, 15%, 20%, 25%, 30%, 35%, and 40% under three missing mechanisms. Mean substitution (MS), random forest regression (RFR), and predictive mean matching (PMM) were used to fit the data. Root mean square error (RMSE), the width of 95% confidence intervals (95% CI), and Spearman correlation coefficient (SCC) were used to evaluate the fitting effect and determine the optimal fitting path. Results. when dealing with the problem of missing data in scales, the optimal fitting path is ① under the MCAR deletion mechanism, when the deletion proportion is less than 20%, the MS method is the most convenient; when the missing ratio is greater than 20%, RFR algorithm is the best fitting method. ② Under the Mar mechanism, when the deletion ratio is less than 35%, the MS method is the most convenient. When the deletion ratio is greater than 35%, RFR has a better correlation. ③ Under the mechanism of MNAR, RFR is the best data fitting method, especially when the missing proportion is greater than 30%. In reality, when the deletion ratio is small, the complete case deletion method is the most commonly used, but the RFR algorithm can greatly expand the application scope of samples and save the cost of clinical research when the deletion ratio is less than 30%. The best way to deal with data missing should be based on the missing mechanism and proportion of actual data, and choose the best method between the statistical analysis ability of the research team, the effectiveness of the method, and the understanding of readers.


2021 ◽  
pp. 1-9
Author(s):  
Moritz Marbach

Abstract Imputing missing values is an important preprocessing step in data analysis, but the literature offers little guidance on how to choose between imputation models. This letter suggests adopting the imputation model that generates a density of imputed values most similar to those of the observed values for an incomplete variable after balancing all other covariates. We recommend stable balancing weights as a practical approach to balance covariates whose distribution is expected to differ if the values are not missing completely at random. After balancing, discrepancy statistics can be used to compare the density of imputed and observed values. We illustrate the application of the suggested approach using simulated and real-world survey data from the American National Election Study, comparing popular imputation approaches including random forests, hot-deck, predictive mean matching, and multivariate normal imputation. An R package implementing the suggested approach accompanies this letter.


2021 ◽  
Author(s):  
Trenton J. Davis ◽  
Tarek R. Firzli ◽  
Emily A. Higgins Keppler ◽  
Matt Richardson ◽  
Heather D. Bean

Missing data is a significant issue in metabolomics that is often neglected when conducting data pre-processing, particularly when it comes to imputation. This can have serious implications for downstream statistical analyses and lead to misleading or uninterpretable inferences. In this study, we aim to identify the primary types of missingness that affect untargeted metab-olomics data and compare strategies for imputation using two real-world comprehensive two-dimensional gas chromatog-raphy (GC×GC) data sets. We also present these goals in the context of experimental replication whereby imputation is con-ducted in a within-replicate-based fashion—the first description and evaluation of this strategy—and introduce an R package MetabImpute to carry out these analyses. Our results conclude that, in these two data sets, missingness was most likely of the missing at-random (MAR) and missing not-at-random (MNAR) types as opposed to missing completely at-random (MCAR). Gibbs sampler imputation and Random Forest gave the best results when imputing MAR and MNAR compared against single-value imputation (zero, minimum, mean, median, and half-minimum) and other more sophisticated approach-es (Bayesian principal components analysis and quantile regression imputation for left-censored data). When samples are replicated, within-replicate imputation approaches led to an increase in the reproducibility of peak quantification compared to imputation that ignores replication, suggesting that imputing with respect to replication may preserve potentially im-portant features in downstream analyses for biomarker discovery.


Author(s):  
A. Iodice D’Enza ◽  
A. Markos ◽  
F. Palumbo

AbstractStandard multivariate techniques like Principal Component Analysis (PCA) are based on the eigendecomposition of a matrix and therefore require complete data sets. Recent comparative reviews of PCA algorithms for missing data showed the regularised iterative PCA algorithm (RPCA) to be effective. This paper presents two chunk-wise implementations of RPCA suitable for the imputation of “tall” data sets, that is, data sets with many observations. A “chunk” is a subset of the whole set of available observations. In particular, one implementation is suitable for distributed computation as it imputes each chunk independently. The other implementation, instead, is suitable for incremental computation, where the imputation of each new chunk is based on all the chunks analysed that far. The proposed procedures were compared to batch RPCA considering different data sets and missing data mechanisms. Experimental results showed that the distributed approach had similar performance to batch RPCA for data with entries missing completely at random. The incremental approach showed appreciable performance when the data is missing not completely at random, and the first analysed chunks contain sufficient information on the data structure.


2021 ◽  
Author(s):  
Abduruhman Fahad Alajmi1 ◽  
Hmoud Al-Olimat ◽  
Reham Abu Ghaboush ◽  
Nada A. Al Buniaian

<p>An online questionnaire was distributed to the target population (<i>N </i>= ~2000); 226 completed forms were received from respondents Missing values in all variables did not exceed 6% of cases. Missing data analysis was then followed with Little’s (1988) missing completely at random test. The results were not significant, χ<sup>2</sup> (59) = 73.340, <i>p</i> = .099, suggesting that the values were missing entirely by chance. Thus, the missing values in the dataset were estimated with the expectation–maximization algorithm. To examine outliers among cases, data were evaluated for univariate and multivariate outliers by examining Mahalanobis distance for each participant. An outlier was defined as a Mahalanobis score that was over than Mahal. Critical score cv = 55.32; univariate or multivariate outliers were 31 cases with 13% (Tabachnik & Fidell, 2013, McLachlan GJ. (1999).</p>


2021 ◽  
Author(s):  
Abduruhman Fahad Alajmi1 ◽  
Hmoud Al-Olimat ◽  
Reham Abu Ghaboush ◽  
Nada A. Al Buniaian

<p>An online questionnaire was distributed to the target population (<i>N </i>= ~2000); 226 completed forms were received from respondents Missing values in all variables did not exceed 6% of cases. Missing data analysis was then followed with Little’s (1988) missing completely at random test. The results were not significant, χ<sup>2</sup> (59) = 73.340, <i>p</i> = .099, suggesting that the values were missing entirely by chance. Thus, the missing values in the dataset were estimated with the expectation–maximization algorithm. To examine outliers among cases, data were evaluated for univariate and multivariate outliers by examining Mahalanobis distance for each participant. An outlier was defined as a Mahalanobis score that was over than Mahal. Critical score cv = 55.32; univariate or multivariate outliers were 31 cases with 13% (Tabachnik & Fidell, 2013, McLachlan GJ. (1999).</p>


2021 ◽  
Author(s):  
Luiz Felipe Sousa ◽  
Adam Dreyton Ferreira Santos ◽  
João Weyl Albuquerque Costa

Um problema comum em grandes conjuntos de dados é a informação ausente, seja por falha nos sensores de captura, perda no transporte, ou outra situação que culmine com a perda de dados. Diante desta situação, é frequente que a decisão do pesquisador seja desconsiderar os dados ausentes, removê-los do conjunto, no entanto, essa exclusão pode gerar inferências que não são válidas, principalmente se os dados que permanecem na análise são diferentes daqueles que foram excluídos. Para lidar com este problema em conjuntos de dados de monitoramento de integridade estrutural (Structural health monitoring – SHM), este trabalho faz uso de redes neurais recorrentes Gated Recurrent Units (GRU) e Long Short-Term Memory (LSTM), para realizar a tarefa de imputação de dados ausentes. Em uma etapa anterior à imputação, foi realizada a amputação artificial dos dados, assumindo o mecanismo de dados ausentes Missing Completely at Random (MCAR), em percentuais de 25, 50 e 75%. As técnicas de imputação foram avaliadas com o uso da métrica Mean Absolute Percentage Error (MAPE). Posteriormente, foi aplicada a etapa de detecção de dano, as bases imputadas foram submetidas aos algoritmos Mahalanobis Square Distance (MSD) e Kernel Principal Component Analysis (KPCA) a fim de se obter as taxas de erros T1 e T2 detectadas. A partir dos resultados obtidos, foi possível observar que o uso da LSTM na imputação dos dados, alcançou resultados melhores que a GRU em todas as taxas de amputação, este melhor desempenho pode também ser notado na etapa de detecção de dano, onde as bases imputadas por LSTM alcançam melhores resultados de detecção de erros T1 e T2.


PLoS ONE ◽  
2020 ◽  
Vol 15 (12) ◽  
pp. e0243487
Author(s):  
Michael Lenz ◽  
Andreas Schulz ◽  
Thomas Koeck ◽  
Steffen Rapp ◽  
Markus Nagler ◽  
...  

Targeted proteomics utilizing antibody-based proximity extension assays provides sensitive and highly specific quantifications of plasma protein levels. Multivariate analysis of this data is hampered by frequent missing values (random or left censored), calling for imputation approaches. While appropriate missing-value imputation methods exist, benchmarks of their performance in targeted proteomics data are lacking. Here, we assessed the performance of two methods for imputation of values missing completely at random, the previously top-benchmarked ‘missForest’ and the recently published ‘GSimp’ method. Evaluation was accomplished by comparing imputed with remeasured relative concentrations of 91 inflammation related circulating proteins in 86 samples from a cohort of 645 patients with venous thromboembolism. The median Pearson correlation between imputed and remeasured protein expression values was 69.0% for missForest and 71.6% for GSimp (p = 5.8e-4). Imputation with missForest resulted in stronger reduction of variance compared to GSimp (median relative variance of 25.3% vs. 68.6%, p = 2.4e-16) and undesired larger bias in downstream analyses. Irrespective of the imputation method used, the 91 imputed proteins revealed large variations in imputation accuracy, driven by differences in signal to noise ratio and information overlap between proteins. In summary, GSimp outperformed missForest, while both methods show good overall imputation accuracy with large variations between proteins.


Sign in / Sign up

Export Citation Format

Share Document