Exploring the Impact of Missing Data on Residual-Based Dimensionality Analysis for Measurement Models

2020 ◽  
pp. 001316442093963
Author(s):  
Stefanie A. Wind ◽  
Randall E. Schumacker

Researchers frequently use Rasch models to analyze survey responses because these models provide accurate parameter estimates for items and examinees when there are missing data. However, researchers have not fully considered how missing data affect the accuracy of dimensionality assessment in Rasch analyses such as principal components analysis (PCA) of standardized residuals. Because adherence to unidimensionality is a prerequisite for the appropriate interpretation and use of Rasch model results, insight into the impact of missing data on the accuracy of this approach is critical. We used a simulation study to examine the accuracy of standardized residual PCA with various proportions of missing data and multidimensionality. We also explored an adaptation of modified parallel analysis in combination with standardized residual PCA as a source of additional information about dimensionality when missing data are present. Our results suggested that missing data impact the accuracy of PCA on standardized residuals, and that the adaptation of modified parallel analysis provides useful supplementary information about dimensionality when there are missing data.

2021 ◽  
Vol 45 (3) ◽  
pp. 159-177
Author(s):  
Chen-Wei Liu

Missing not at random (MNAR) modeling for non-ignorable missing responses usually assumes that the latent variable distribution is a bivariate normal distribution. Such an assumption is rarely verified and often employed as a standard in practice. Recent studies for “complete” item responses (i.e., no missing data) have shown that ignoring the nonnormal distribution of a unidimensional latent variable, especially skewed or bimodal, can yield biased estimates and misleading conclusion. However, dealing with the bivariate nonnormal latent variable distribution with present MNAR data has not been looked into. This article proposes to extend unidimensional empirical histogram and Davidian curve methods to simultaneously deal with nonnormal latent variable distribution and MNAR data. A simulation study is carried out to demonstrate the consequence of ignoring bivariate nonnormal distribution on parameter estimates, followed by an empirical analysis of “don’t know” item responses. The results presented in this article show that examining the assumption of bivariate nonnormal latent variable distribution should be considered as a routine for MNAR data to minimize the impact of nonnormality on parameter estimates.


2021 ◽  
Author(s):  
Trenton J. Davis ◽  
Tarek R. Firzli ◽  
Emily A. Higgins Keppler ◽  
Matt Richardson ◽  
Heather D. Bean

Missing data is a significant issue in metabolomics that is often neglected when conducting data pre-processing, particularly when it comes to imputation. This can have serious implications for downstream statistical analyses and lead to misleading or uninterpretable inferences. In this study, we aim to identify the primary types of missingness that affect untargeted metab-olomics data and compare strategies for imputation using two real-world comprehensive two-dimensional gas chromatog-raphy (GC×GC) data sets. We also present these goals in the context of experimental replication whereby imputation is con-ducted in a within-replicate-based fashion—the first description and evaluation of this strategy—and introduce an R package MetabImpute to carry out these analyses. Our results conclude that, in these two data sets, missingness was most likely of the missing at-random (MAR) and missing not-at-random (MNAR) types as opposed to missing completely at-random (MCAR). Gibbs sampler imputation and Random Forest gave the best results when imputing MAR and MNAR compared against single-value imputation (zero, minimum, mean, median, and half-minimum) and other more sophisticated approach-es (Bayesian principal components analysis and quantile regression imputation for left-censored data). When samples are replicated, within-replicate imputation approaches led to an increase in the reproducibility of peak quantification compared to imputation that ignores replication, suggesting that imputing with respect to replication may preserve potentially im-portant features in downstream analyses for biomarker discovery.


2021 ◽  
Author(s):  
Cori Pegliasco ◽  
Antoine Delepoulle ◽  
Rosemary Morrow ◽  
Yannice Faugère ◽  
Gérald Dibarboure

Abstract. This paper presents the new global Mesoscale Eddy Trajectories Atlases (META3.1exp DT all-satellites, https://doi.org/10.24400/527896/a01-2021.001, Pegliasco et al., 2021a and META3.1exp DT two-satellites, https://doi.org/10.24400/527896/a01-2021.002, Pegliasco et al., 2021b), composed of the eddies’ identifications and trajectories produced with altimetric maps. The detection method used is a heritage of the py-eddy-tracker algorithm developed by Mason et al. (2014), optimized to manage with efficiency large datasets, and thus long time series. These products are an improvement of the META2.0 product, produced by SSALTO/DUACS and distributed by AVISO+ (https://aviso.altimetry.fr) with support from CNES, in collaboration with Oregon State University with support from NASA and based on Chelton et al. (2011). META3.1exp provides supplementary information such as the mesoscale eddy shapes with the eddy edges and their maximum speed contour, and the eddy speed profiles from the center to the edge. The tracking algorithm used is based on overlapping contours, includes virtual observations and acts as a filter with respect to the shortest trajectories. The absolute dynamic topography field is now used for eddy detection, instead of the sea level anomaly maps, to better represent the ocean dynamics in the more energetic areas and close to coasts and islands. To evaluate the impact of the changes from META2.0 to META3.1exp, a comparison methodology has been applied. The similarity coefficient is based on the ratio between the eddies' overlap and their cumulative area, and allows an extensive comparison of the different datasets in terms of geographic distribution, statistics over the main physical characteristics, changes in the lifetime of the trajectories, etc. After evaluating the impact of each change separately, we conclude that the major differences between META3.1exp and META2.0 are due to the change in the detection algorithm. META3.1exp contains smaller eddies and trajectories lasting at least 10 days that were not available in the distributed META2.0 product. Nevertheless, 55 % of the structures in META2.0 are similar in META3.1exp, ensuring the continuity between the two products, and the physical characteristics of the common eddies are close. Geographically, the eddy distribution mainly differs in the strong current regions, where the mean dynamic topography gradients are sharp. The additional information on the eddy contours allows more accurate collocation of mesoscale structures with data from other sources, so META3.1exp is recommended for multi-disciplinary applications.


2021 ◽  
Vol 12 ◽  
Author(s):  
Lihan Chen ◽  
Victoria Savalei

In missing data analysis, the reporting of missing rates is insufficient for the readers to determine the impact of missing data on the efficiency of parameter estimates. A more diagnostic measure, the fraction of missing information (FMI), shows how the standard errors of parameter estimates increase from the information loss due to ignorable missing data. FMI is well-known in the multiple imputation literature (Rubin, 1987), but it has only been more recently developed for full information maximum likelihood (Savalei and Rhemtulla, 2012). Sample FMI estimates using this approach have since then been made accessible as part of the lavaan package (Rosseel, 2012) in the R statistical programming language. However, the properties of FMI estimates at finite sample sizes have not been the subject of comprehensive investigation. In this paper, we present a simulation study on the properties of three sample FMI estimates from FIML in two common models in psychology, regression and two-factor analysis. We summarize the performance of these FMI estimates and make recommendations on their application.


2021 ◽  
pp. 001316442110220
Author(s):  
David Goretzko

Determining the number of factors in exploratory factor analysis is arguably the most crucial decision a researcher faces when conducting the analysis. While several simulation studies exist that compare various so-called factor retention criteria under different data conditions, little is known about the impact of missing data on this process. Hence, in this study, we evaluated the performance of different factor retention criteria—the Factor Forest, parallel analysis based on a principal component analysis as well as parallel analysis based on the common factor model and the comparison data approach—in combination with different missing data methods, namely an expectation-maximization algorithm called Amelia, predictive mean matching, and random forest imputation within the multiple imputations by chained equations (MICE) framework as well as pairwise deletion with regard to their accuracy in determining the number of factors when data are missing. Data were simulated for different sample sizes, numbers of factors, numbers of manifest variables (indicators), between-factor correlations, missing data mechanisms and proportions of missing values. In the majority of conditions and for all factor retention criteria except the comparison data approach, the missing data mechanism had little impact on the accuracy and pairwise deletion performed comparably well as the more sophisticated imputation methods. In some conditions, especially small-sample cases and when comparison data were used to determine the number of factors, random forest imputation was preferable to other missing data methods, though. Accordingly, depending on data characteristics and the selected factor retention criterion, choosing an appropriate missing data method is crucial to obtain a valid estimate of the number of factors to extract.


Methodology ◽  
2015 ◽  
Vol 11 (3) ◽  
pp. 89-99 ◽  
Author(s):  
Leslie Rutkowski ◽  
Yan Zhou

Abstract. Given a consistent interest in comparing achievement across sub-populations in international assessments such as TIMSS, PIRLS, and PISA, it is critical that sub-population achievement is estimated reliably and with sufficient precision. As such, we systematically examine the limitations to current estimation methods used by these programs. Using a simulation study along with empirical results from the 2007 cycle of TIMSS, we show that a combination of missing and misclassified data in the conditioning model induces biases in sub-population achievement estimates, the magnitude and degree to which can be readily explained by data quality. Importantly, estimated biases in sub-population achievement are limited to the conditioning variable with poor-quality data while other sub-population achievement estimates are unaffected. Findings are generally in line with theory on missing and error-prone covariates. The current research adds to a small body of literature that has noted some of the limitations to sub-population estimation.


Marketing ZFP ◽  
2019 ◽  
Vol 41 (4) ◽  
pp. 21-32
Author(s):  
Dirk Temme ◽  
Sarah Jensen

Missing values are ubiquitous in empirical marketing research. If missing data are not dealt with properly, this can lead to a loss of statistical power and distorted parameter estimates. While traditional approaches for handling missing data (e.g., listwise deletion) are still widely used, researchers can nowadays choose among various advanced techniques such as multiple imputation analysis or full-information maximum likelihood estimation. Due to the available software, using these modern missing data methods does not pose a major obstacle. Still, their application requires a sound understanding of the prerequisites and limitations of these methods as well as a deeper understanding of the processes that have led to missing values in an empirical study. This article is Part 1 and first introduces Rubin’s classical definition of missing data mechanisms and an alternative, variable-based taxonomy, which provides a graphical representation. Secondly, a selection of visualization tools available in different R packages for the description and exploration of missing data structures is presented.


Author(s):  
Eman Al-erqi ◽  
◽  
Mohd Lizam Mohd Diah ◽  
Najmaddin Abo Mosali ◽  
◽  
...  

This study seeks to address the impact of service quality affecting international student's satisfaction towards loyalty tothe Universiti Tun Hussein Onn Malaysia(UTHM). The aim of thestudy is to develop relationship between service quality factor and loyalty to the university from the international students’ perspectives. The study adopted quantitative approach where data was collected through questionnaire survey and analysed statistically. A total of 246 responses were received and found to be valid. The model was developed and analysed using AMOS-SEM software. Confirmatory factor analysis (CFA) function of the software was to assessed the measurement models and found that all the models achieved goodness of fit. Then path analysis function was used to assessed structural model and found that service qualityfactors have a significant effect on the students’ satisfaction and thus affecting the loyaltyto the university. Hopefully the outcome form this study will benefit the university in providing services especially to the international students.


Author(s):  
Robert F Engle ◽  
Martin Klint Hansen ◽  
Ahmet K Karagozoglu ◽  
Asger Lunde

Abstract Motivated by the recent availability of extensive electronic news databases and advent of new empirical methods, there has been renewed interest in investigating the impact of financial news on market outcomes for individual stocks. We develop the information processing hypothesis of return volatility to investigate the relation between firm-specific news and volatility. We propose a novel dynamic econometric specification and test it using time series regressions employing a machine learning model selection procedure. Our empirical results are based on a comprehensive dataset comprised of more than 3 million news items for a sample of 28 large U.S. companies. Our proposed econometric specification for firm-specific return volatility is a simple mixture model with two components: public information and private processing of public information. The public information processing component is defined by the contemporaneous relation with public information and volatility, while the private processing of public information component is specified as a general autoregressive process corresponding to the sequential price discovery mechanism of investors as additional information, previously not publicly available, is generated and incorporated into prices. Our results show that changes in return volatility are related to public information arrival and that including indicators of public information arrival explains on average 26% (9–65%) of changes in firm-specific return volatility.


Mathematics ◽  
2021 ◽  
Vol 9 (6) ◽  
pp. 692
Author(s):  
Clara Calvo ◽  
Carlos Ivorra ◽  
Vicente Liern ◽  
Blanca Pérez-Gladish

Modern portfolio theory deals with the problem of selecting a portfolio of financial assets such that the expected return is maximized for a given level of risk. The forecast of the expected individual assets’ returns and risk is usually based on their historical returns. In this work, we consider a situation in which the investor has non-historical additional information that is used for the forecast of the expected returns. This implies that there is no obvious statistical risk measure any more, and it poses the problem of selecting an adequate set of diversification constraints to mitigate the risk of the selected portfolio without losing the value of the non-statistical information owned by the investor. To address this problem, we introduce an indicator, the historical reduction index, measuring the expected reduction of the expected return due to a given set of diversification constraints. We show that it can be used to grade the impact of each possible set of diversification constraints. Hence, the investor can choose from this gradation, the set better fitting his subjective risk-aversion level.


Sign in / Sign up

Export Citation Format

Share Document