Assessing Model Fit in Latent Class Analysis When Asymptotics Do Not Hold

The application of latent class (LC) analysis involves evaluating the LC model using goodness-of-fit statistics. To assess the misfit of a specified model, say with the Pearson chi-squared statistic, a p-value can be obtained using an asymptotic reference distribution. However, asymptotic p-values are not valid when the sample size is not large and/or the analyzed contingency table is sparse. Another problem is that for various other conceivable global and local fit measures, asymptotic distributions are not readily available. An alternative way to obtain the p-value for the statistic of interest is by constructing its empirical reference distribution using resampling techniques such as the parametric bootstrap or the posterior predictive check (PPC). In the current paper, we show how to apply the parametric bootstrap and two versions of the PPC to obtain empirical p-values for a number of commonly used global and local fit statistics within the context of LC analysis. The main difference between the PPC using test statistics and the parametric bootstrap is that the former takes into account parameter uncertainty. The PPC using discrepancies has the advantage that it is computationally much less intensive than the other two resampling methods. In a Monte Carlo study we evaluated Type I error rates and power of these resampling methods when used for global and local goodness-of-fit testing in LC analysis. Results show that both the bootstrap and the PPC using test statistics are generally good alternatives to asymptotic p-values and can also be used when (asymptotic) distributions are not known. Nominal Type I error rates were not met when sample size was small and the contingency table has many cells. Overall the PPC using test statistics was somewhat more conservative than the parametric bootstrap. We have also replicated previous research suggesting that the Pearson χ2 statistic should in many cases be preferred over the likelihood-ratio G2 statistic. Power to reject a model for which the number of LCs was one less than in the population was very high, unless sample size was small. When the contingency tables are very sparse, the total bivariate residual (TBVR) statistic, which is based on bivariate relationships, still had very high power, signifying its usefulness in assessing model fit.

Download Full-text

Asymptotic versus exact methods in the analysis of contingency tables: Evidence-based practical recommendations

Statistical Methods in Medical Research ◽

10.1177/0962280220902480 ◽

2020 ◽

Vol 29 (9) ◽

pp. 2569-2582

Author(s):

Miguel A García-Pérez ◽

Vicente Núñez-Antón

Keyword(s):

Type I Error ◽

Contingency Tables ◽

Discrete Distribution ◽

Error Rates ◽

Asymptotic Distributions ◽

Type I ◽

Exact Tests ◽

Exact Methods ◽

P Values ◽

Multinomial Models

Controversy over the validity of significance tests in the analysis of contingency tables is motivated by the disagreement between asymptotic and exact p values and its dependence on the magnitude of expected frequencies. Variants of Pearson’s X2 statistic and their asymptotic distributions were proposed to overcome the difficulties, but several approaches also exist to conduct exact tests. This paper shows that discrepant asymptotic and exact results may or may not occur whether expected frequencies are large or small: Eventual inaccuracy of asymptotic p values is instead caused by idiosyncrasies of the discrete distribution of X2. More importantly, discrepancies are also artificially created by the hypergeometric sampling model used to perform exact tests. Exact computations under the alternative full-multinomial or product-multinomial models require eliminating nuisance parameters and we propose a novel method that integrates them out. The resultant exact distributions are very accurately approximated by the asymptotic distribution, which eliminates concerns about the accuracy of the latter. We also discuss that the two-stage approach that tests for significance of residuals conditional on a significant X2 test is inadvisable and that an alternative single-stage test preserves Type-I error rates and further eliminates concerns about asymptotic accuracy.

Download Full-text

Systematic Review of the use of “Magnitude-Based Inference” in Sports Science and Medicine

10.31236/osf.io/wugcr ◽

2020 ◽

Cited By ~ 1

Author(s):

Keith Lohse ◽

Kristin Sainani ◽

J. Andrew Taylor ◽

Michael Lloyd Butson ◽

Emma Knight ◽

...

Keyword(s):

Sample Size ◽

Multiple Testing ◽

Type I Error ◽

A Priori ◽

Error Rates ◽

Significance Testing ◽

Type I ◽

P Values ◽

Type I Error Rates ◽

Sports Science

Magnitude-based inference (MBI) is a controversial statistical method that has been used in hundreds of papers in sports science despite criticism from statisticians. To better understand how this method has been applied in practice, we systematically reviewed 232 papers that used MBI. We extracted data on study design, sample size, and choice of MBI settings and parameters. Median sample size was 10 per group (interquartile range, IQR: 8 – 15) for multi-group studies and 14 (IQR: 10 – 24) for single-group studies; few studies reported a priori sample size calculations (15%). Authors predominantly applied MBI’s default settings and chose “mechanistic/non-clinical” rather than “clinical” MBI even when testing clinical interventions (only 14 studies out of 232 used clinical MBI). Using these data, we can estimate the Type I error rates for the typical MBI study. Authors frequently made dichotomous claims about effects based on the MBI criterion of a “likely” effect and sometimes based on the MBI criterion of a “possible” effect. When the sample size is n=8 to 15 per group, these inferences have Type I error rates of 12%-22% and 22%-45%, respectively. High Type I error rates were compounded by multiple testing: Authors reported results from a median of 30 tests related to outcomes; and few studies specified a primary outcome (14%). We conclude that MBI has promoted small studies, promulgated a “black box” approach to statistics, and led to numerous papers where the conclusions are not supported by the data. Amidst debates over the role of p-values and significance testing in science, MBI also provides an important natural experiment: we find no evidence that moving researchers away from p-values or null hypothesis significance testing makes them less prone to dichotomization or over-interpretation of findings.

Download Full-text

Goodness-of-Fit Tests for Bivariate Time Series of Counts

Econometrics ◽

10.3390/econometrics9010010 ◽

2021 ◽

Vol 9 (1) ◽

pp. 10

Author(s):

Šárka Hudecová ◽

Marie Hušková ◽

Simos G. Meintanis

Keyword(s):

Goodness Of Fit ◽

Probability Generating Function ◽

Parametric Bootstrap ◽

Real Data ◽

Data Sets ◽

Test Statistics ◽

Finite Sample ◽

Generalized Poisson ◽

Goodness Of Fit Tests ◽

Monte Carlo Experiments

This article considers goodness-of-fit tests for bivariate INAR and bivariate Poisson autoregression models. The test statistics are based on an L2-type distance between two estimators of the probability generating function of the observations: one being entirely nonparametric and the second one being semiparametric computed under the corresponding null hypothesis. The asymptotic distribution of the proposed tests statistics both under the null hypotheses as well as under alternatives is derived and consistency is proved. The case of testing bivariate generalized Poisson autoregression and extension of the methods to dimension higher than two are also discussed. The finite-sample performance of a parametric bootstrap version of the tests is illustrated via a series of Monte Carlo experiments. The article concludes with applications on real data sets and discussion.

Download Full-text

Evaluating the Fit of Sequential G-DINA Model Using Limited-Information Measures

Applied Psychological Measurement ◽

10.1177/0146621619843829 ◽

2019 ◽

Vol 44 (3) ◽

pp. 167-181 ◽

Cited By ~ 2

Author(s):

Wenchao Ma

Keyword(s):

Goodness Of Fit ◽

Type I Error ◽

Model Simulation ◽

Real Data ◽

Error Rates ◽

Type I ◽

Limited Information ◽

Detection Rates ◽

Root Mean Square Residual ◽

Information Measures

Limited-information fit measures appear to be promising in assessing the goodness-of-fit of dichotomous response cognitive diagnosis models (CDMs), but their performance has not been examined for polytomous response CDMs. This study investigates the performance of the Mord statistic and standardized root mean square residual (SRMSR) for an ordinal response CDM—the sequential generalized deterministic inputs, noisy “and” gate model. Simulation studies showed that the Mord statistic had well-calibrated Type I error rates, but the correct detection rates were influenced by various factors such as item quality, sample size, and the number of response categories. In addition, the SRMSR was also influenced by many factors and the common practice of comparing the SRMSR against a prespecified cut-off (e.g., .05) may not be appropriate. A set of real data was analyzed as well to illustrate the use of Mord statistic and SRMSR in practice.

Download Full-text

Comparison of a Two-Stage and Three-Stage Interim-Analysis Procedure

Psychological Reports ◽

10.2466/pr0.1992.71.1.3 ◽

1992 ◽

Vol 71 (1) ◽

pp. 3-14 ◽

Cited By ~ 1

Author(s):

John E. Overall ◽

Robert S. Atlas

Keyword(s):

Sample Size ◽

Interim Analysis ◽

Type I Error ◽

Substantial Reduction ◽

Error Rates ◽

Sampling Plan ◽

Type I ◽

Two Stage ◽

Expected Sample Size ◽

Analysis Strategy

A statistical model for combining p values from multiple tests of significance is used to define rejection and acceptance regions for two-stage and three-stage sampling plans. Type I error rates, power, frequencies of early termination decisions, and expected sample sizes are compared. Both the two-stage and three-stage procedures provide appropriate protection against Type I errors. The two-stage sampling plan with its single interim analysis entails minimal loss in power and provides substantial reduction in expected sample size as compared with a conventional single end-of-study test of significance for which power is in the adequate range. The three-stage sampling plan with its two interim analyses introduces somewhat greater reduction in power, but it compensates with greater reduction in expected sample size. Either interim-analysis strategy is more efficient than a single end-of-study analysis in terms of power per unit of sample size.

Download Full-text

Use of interval estimations in design and evaluation of multiregional clinical trials with continuous outcomes

Statistical Methods in Medical Research ◽

10.1177/0962280217751277 ◽

2018 ◽

Vol 28 (7) ◽

pp. 2179-2195 ◽

Cited By ~ 1

Author(s):

Chieh Chiang ◽

Chin-Fu Hsiao

Keyword(s):

Clinical Trials ◽

Sample Size ◽

Type I Error ◽

Interval Estimation ◽

Error Rates ◽

New Drugs ◽

Sample Size Determination ◽

Type I ◽

Size Determination ◽

Interval Estimators

Multiregional clinical trials have been accepted in recent years as a useful means of accelerating the development of new drugs and abridging their approval time. The statistical properties of multiregional clinical trials are being widely discussed. In practice, variance of a continuous response may be different from region to region, but it leads to the assessment of the efficacy response falling into a Behrens–Fisher problem—there is no exact testing or interval estimator for mean difference with unequal variances. As a solution, this study applies interval estimations of the efficacy response based on Howe’s, Cochran–Cox’s, and Satterthwaite’s approximations, which have been shown to have well-controlled type I error rates. However, the traditional sample size determination cannot be applied to the interval estimators. The sample size determination to achieve a desired power based on these interval estimators is then presented. Moreover, the consistency criteria suggested by the Japanese Ministry of Health, Labour and Welfare guidance to decide whether the overall results from the multiregional clinical trial obtained via the proposed interval estimation were also applied. A real example is used to illustrate the proposed method. The results of simulation studies indicate that the proposed method can correctly determine the required sample size and evaluate the assurance probability of the consistency criteria.

Download Full-text

A Simulation Study to Assess the Effect of the Number of Response Categories on the Power of Ordinal Logistic Regression for Differential Item Functioning Analysis in Rating Scales

Computational and Mathematical Methods in Medicine ◽

10.1155/2016/5080826 ◽

2016 ◽

Vol 2016 ◽

pp. 1-8 ◽

Cited By ~ 3

Author(s):

Elahe Allahyari ◽

Peyman Jafari ◽

Zahra Bagheri

Keyword(s):

Logistic Regression ◽

Sample Size ◽

Differential Item Functioning ◽

Rating Scales ◽

Error Rates ◽

Ordinal Logistic Regression ◽

Type I ◽

Item Functioning ◽

Quality Of Life Scale ◽

The Impact

Objective.The present study uses simulated data to find what the optimal number of response categories is to achieve adequate power in ordinal logistic regression (OLR) model for differential item functioning (DIF) analysis in psychometric research.Methods.A hypothetical ten-item quality of life scale with three, four, and five response categories was simulated. The power and type I error rates of OLR model for detecting uniform DIF were investigated under different combinations of ability distribution (θ), sample size, sample size ratio, and the magnitude of uniform DIF across reference and focal groups.Results.Whenθwas distributed identically in the reference and focal groups, increasing the number of response categories from 3 to 5 resulted in an increase of approximately 8% in power of OLR model for detecting uniform DIF. The power of OLR was less than 0.36 when ability distribution in the reference and focal groups was highly skewed to the left and right, respectively.Conclusions.The clearest conclusion from this research is that the minimum number of response categories for DIF analysis using OLR is five. However, the impact of the number of response categories in detecting DIF was lower than might be expected.

Download Full-text

Power and Type I error rates of goodness-of-fit statistics for binomial generalized estimating equations (GEE) models

Computational Statistics & Data Analysis ◽

10.1016/j.csda.2005.07.017 ◽

2006 ◽

Vol 50 (12) ◽

pp. 3432-3448 ◽

Cited By ~ 7

Author(s):

Hui-Yi Lin ◽

Leann Myers

Keyword(s):

Generalized Estimating Equations ◽

Goodness Of Fit ◽

Type I Error ◽

Estimating Equations ◽

Error Rates ◽

Type I ◽

Type I Error Rates ◽

Fit Statistics ◽

Generalized Estimating

Download Full-text

Comparative evaluation of goodness of fit tests for normal distribution using simulation and empirical data

Biometrical Letters ◽

10.2478/bile-2020-0015 ◽

2020 ◽

Vol 57 (2) ◽

pp. 237-251

Author(s):

Achilleas Anastasiou ◽

Alex Karagrigoriou ◽

Anastasios Katsileros

Keyword(s):

Sample Size ◽

Normal Distribution ◽

Empirical Data ◽

Goodness Of Fit ◽

Small Sample Size ◽

Small Sample ◽

Type I ◽

Sample Sizes ◽

Omnibus Test ◽

Wide Range

SummaryThe normal distribution is considered to be one of the most important distributions, with numerous applications in various fields, including the field of agricultural sciences. The purpose of this study is to evaluate the most popular normality tests, comparing the performance in terms of the size (type I error) and the power against a large spectrum of distributions with simulations for various sample sizes and significance levels, as well as through empirical data from agricultural experiments. The simulation results show that the power of all normality tests is low for small sample size, but as the sample size increases, the power increases as well. Also, the results show that the Shapiro–Wilk test is powerful over a wide range of alternative distributions and sample sizes and especially in asymmetric distributions. Moreover the D’Agostino–Pearson Omnibus test is powerful for small sample sizes against symmetric alternative distributions, while the same is true for the Kurtosis test for moderate and large sample sizes.

Download Full-text

Comparison of Test Statistics of Nonnormal and Unbalanced Samples for Multivariate Analysis of Variance in terms of Type-I Error Rates

Computational and Mathematical Methods in Medicine ◽

10.1155/2019/2173638 ◽

2019 ◽

Vol 2019 ◽

pp. 1-8 ◽

Cited By ~ 8

Author(s):

Can Ateş ◽

Özlem Kaymaz ◽

H. Emre Kale ◽

Mustafa Agah Tekindal

Keyword(s):

Normal Distribution ◽

Type I Error ◽

Error Rates ◽

Type I ◽

Test Statistics ◽

Test Statistic ◽

Heterogeneous Variance ◽

Heterogeneous Variances ◽

Trace Test ◽

Wilks Lambda

In this study, we investigate how Wilks’ lambda, Pillai’s trace, Hotelling’s trace, and Roy’s largest root test statistics can be affected when the normal and homogeneous variance assumptions of the MANOVA method are violated. In other words, in these cases, the robustness of the tests is examined. For this purpose, a simulation study is conducted in different scenarios. In different variable numbers and different sample sizes, considering the group variances are homogeneous σ12=σ22=⋯=σg2 and heterogeneous (increasing) σ12<σ22<⋯<σg2, random numbers are generated from Gamma(4-4-4; 0.5), Gamma(4-9-36; 0.5), Student’s t(2), and Normal(0; 1) distributions. Furthermore, the number of observations in the groups being balanced and unbalanced is also taken into account. After 10000 repetitions, type-I error values are calculated for each test for α = 0.05. In the Gamma distribution, Pillai’s trace test statistic gives more robust results in the case of homogeneous and heterogeneous variances for 2 variables, and in the case of 3 variables, Roy’s largest root test statistic gives more robust results in balanced samples and Pillai’s trace test statistic in unbalanced samples. In Student’s t distribution, Pillai’s trace test statistic gives more robust results in the case of homogeneous variance and Wilks’ lambda test statistic in the case of heterogeneous variance. In the normal distribution, in the case of homogeneous variance for 2 variables, Roy’s largest root test statistic gives relatively more robust results and Wilks’ lambda test statistic for 3 variables. Also in the case of heterogeneous variance for 2 and 3 variables, Roy’s largest root test statistic gives robust results in the normal distribution. The test statistics used with MANOVA are affected by the violation of homogeneity of covariance matrices and normality assumptions particularly from unbalanced number of observations.

Download Full-text