Is Differential Noneffortful Responding Associated With Type I Error in Measurement Invariance Testing?

Educational and Psychological Measurement ◽

10.1177/0013164421990429 ◽

2021 ◽

pp. 001316442199042

Author(s):

Joseph A. Rios

Keyword(s):

Measurement Invariance ◽

Type I Error ◽

Type I ◽

Negative Effects ◽

Type I Errors ◽

Measurement Quality ◽

Invariance Testing ◽

Test Taking ◽

Scalar Invariance ◽

Assessment Context

Low test-taking effort as a validity threat is common when examinees perceive an assessment context to have minimal personal value. Prior research has shown that in such contexts, subgroups may differ in their effort, which raises two concerns when making subgroup mean comparisons. First, it is unclear how differential effort could influence evaluations of scale property equivalence. Second, if attaining full scalar invariance, the degree to which differential effort can bias subgroup mean comparisons is unknown. To address these issues, a simulation study was conducted to examine the influence of differential noneffortful responding (NER) on evaluations of measurement invariance and latent mean comparisons. Results showed that as differential rates of NER grew, increased Type I errors of measurement invariance were observed only at the metric invariance level, while no negative effects were apparent for configural or scalar invariance. When full scalar invariance was correctly attained, differential NER led to bias of mean score comparisons as large as 0.18 standard deviations with a differential NER rate of 7%. These findings suggest that test users should evaluate and document potential differential NER prior to both conducting measurement quality analyses and reporting disaggregated subgroup mean performance.

Download Full-text

Is Differential Noneffortful Responding Associated with Type I Error in Measurement Invariance Testing?

10.35542/osf.io/t37fp ◽

2021 ◽

Author(s):

Joseph Rios

Keyword(s):

Measurement Invariance ◽

Type I Error ◽

Type I ◽

Negative Effects ◽

Type I Errors ◽

Measurement Quality ◽

Invariance Testing ◽

Test Taking ◽

Scalar Invariance ◽

Assessment Context

Low test-taking effort as a validity threat is common when examinees perceive an assessment context to have minimal personal value. Prior research has shown that in such contexts subgroups may differ in their effort, which raises two concerns when making subgroup mean comparisons. First, it is unclear how differential effort could influence evaluations of scale property equivalence. Second, if attaining full scalar invariance, the degree to which differential effort can bias subgroup mean comparisons is unknown. To address these issues, a simulation study was conducted to examine the influence of differential noneffortful responding (NER) on evaluations of measurement invariance and latent mean comparisons. Results showed that as differential rates of NER grew, increased type I errors of measurement invariance were observed only at the metric invariance level, while no negative effects were apparent for configural or scalar invariance. When full scalar invariance was correctly attained, differential NER led to bias of mean score comparisons as large as 0.18 standard deviations with a differential NER rate of 7%. These findings suggest that test users should evaluate and document potential differential NER prior to both conducting measurement quality analyses and reporting disaggregated subgroup mean performance.

Download Full-text

Is Differential Noneffortful Responding Associated with Type I Error in Measurement Invariance Testing?

10.31234/osf.io/c9fyq ◽

2021 ◽

Author(s):

Joseph Rios

Keyword(s):

Measurement Invariance ◽

Type I Error ◽

Type I ◽

Negative Effects ◽

Type I Errors ◽

Measurement Quality ◽

Invariance Testing ◽

Test Taking ◽

Scalar Invariance ◽

Assessment Context

Low test-taking effort as a validity threat is common when examinees perceive an assessment context to have minimal personal value. Prior research has shown that in such contexts subgroups may differ in their effort, which raises two concerns when making subgroup mean comparisons. First, it is unclear how differential effort could influence evaluations of scale property equivalence. Second, if attaining full scalar invariance, the degree to which differential effort can bias subgroup mean comparisons is unknown. To address these issues, a simulation study was conducted to examine the influence of differential noneffortful responding (NER) on evaluations of measurement invariance and latent mean comparisons. Results showed that as differential rates of NER grew, increased type I errors of measurement invariance were observed only at the metric invariance level, while no negative effects were apparent for configural or scalar invariance. When full scalar invariance was correctly attained, differential NER led to bias of mean score comparisons as large as 0.18 standard deviations with a differential NER rate of 7%. These findings suggest that test users should evaluate and document potential differential NER prior to both conducting measurement quality analyses and reporting disaggregated subgroup mean performance.

Download Full-text

Testing Measurement Invariance Using MIMIC

Educational and Psychological Measurement ◽

10.1177/0013164411427395 ◽

2011 ◽

Vol 72 (3) ◽

pp. 469-492 ◽

Cited By ~ 42

Author(s):

Eun Sook Kim ◽

Myeongsun Yoon ◽

Taehun Lee

Keyword(s):

Measurement Invariance ◽

Type I Error ◽

Factor Loading ◽

Error Rates ◽

Categorical Variables ◽

Ratio Test ◽

Type I ◽

Invariance Testing ◽

Mimic Modeling ◽

Latent Group

Multiple-indicators multiple-causes (MIMIC) modeling is often used to test a latent group mean difference while assuming the equivalence of factor loadings and intercepts over groups. However, this study demonstrated that MIMIC was insensitive to the presence of factor loading noninvariance, which implies that factor loading invariance should be tested through other measurement invariance testing techniques. MIMIC modeling is also used for measurement invariance testing by allowing a direct path from a grouping covariate to each observed variable. This simulation study with both continuous and categorical variables investigated the performance of MIMIC in detecting noninvariant variables under various study conditions and showed that the likelihood ratio test of MIMIC with Oort adjustment not only controlled Type I error rates below the nominal level but also maintained high power across study conditions.

Download Full-text

The Factor Structure and Measurement Invariance of Positive and Negative Affect

European Journal of Psychological Assessment ◽

10.1027/1015-5759/a000252 ◽

2016 ◽

Vol 32 (4) ◽

pp. 265-272 ◽

Cited By ~ 10

Author(s):

Mohsen Joshanloo ◽

Ali Bakhshi

Keyword(s):

Negative Affect ◽

Measurement Invariance ◽

Factor Structure ◽

Factor Model ◽

Positive And Negative Affect ◽

Cultural Groups ◽

Invariance Testing ◽

The Usa ◽

Scalar Invariance ◽

Gender Groups

Abstract. This study investigated the factor structure and measurement invariance of the Mroczek and Kolarz’s scales of positive and negative affect in Iran (N = 2,391) and the USA (N = 2,154), and across gender groups. The two-factor model of affect was supported across the groups. The results of measurement invariance testing confirmed full metric and partial scalar invariance of the scales across cultural groups, and full metric and full scalar invariance across gender groups. The results of latent mean analysis revealed that Iranians scored lower on positive affect and higher on negative affect than Americans. The analyses also showed that American men scored significantly lower than American women on negative affect. The significance and implications of the results are discussed.

Download Full-text

Type I error rates and power of several versions of scaled chi-square difference tests in investigations of measurement invariance.

Psychological Methods ◽

10.1037/met0000097 ◽

2017 ◽

Vol 22 (3) ◽

pp. 467-485 ◽

Cited By ~ 4

Author(s):

Jordan Campbell Brace ◽

Victoria Savalei

Keyword(s):

Measurement Invariance ◽

Type I Error ◽

Error Rates ◽

Type I ◽

Chi Square ◽

Type I Error Rates

Download Full-text

Repeated Measures Multiple Comparison Procedures: Effects of Violating Multisample Sphericity in Unbalanced Designs

Journal of Educational Statistics ◽

10.3102/10769986013003215 ◽

1988 ◽

Vol 13 (3) ◽

pp. 215-226 ◽

Cited By ~ 6

Author(s):

H. J. Keselman ◽

Joanne C. Keselman

Keyword(s):

Repeated Measures ◽

Type I Error ◽

Multiple Comparison ◽

Type I ◽

Type I Errors ◽

Unbalanced Designs ◽

Bonferroni Procedure ◽

Weighted Means ◽

Multiple Comparison Procedures ◽

Multisample Sphericity

Two Tukey multiple comparison procedures as well as a Bonferroni and multivariate approach were compared for their rates of Type I error and any-pairs power when multisample sphericity was not satisfied and the design was unbalanced. Pairwise comparisons of unweighted and weighted repeated measures means were computed. Results indicated that heterogenous covariance matrices in combination with unequal group sizes resulted in substantially inflated rates of Type I error for all MCPs involving comparisons of unweighted means. For tests of weighted means, both the Bonferroni and a multivariate critical value limited the number of Type I errors; however, the Bonferroni procedure provided a more powerful test, particularly when the number of repeated measures treatment levels was large.

Download Full-text

STATISTIC TESTS AIDED MULTI-SOURCE DEM FUSION

ISPRS - International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences ◽

10.5194/isprs-archives-xli-b6-227-2016 ◽

2016 ◽

Vol XLI-B6 ◽

pp. 227-233

Author(s):

C. Y. Fu ◽

J. R. Tsay

Keyword(s):

Land Surface ◽

Type I Error ◽

Low Cost ◽

Type I ◽

Fusion Algorithm ◽

Type I Errors ◽

Production Methods ◽

Grid Points ◽

The Cost ◽

Proper Setting

Since the land surface has been changing naturally or manually, DEMs have to be updated continually to satisfy applications using the latest DEM at present. However, the cost of wide-area DEM production is too high. DEMs, which cover the same area but have different quality, grid sizes, generation time or production methods, are called as multi-source DEMs. It provides a solution to fuse multi-source DEMs for low cost DEM updating. The coverage of DEM has to be classified according to slope and visibility in advance, because the precisions of DEM grid points in different areas with different slopes and visibilities are not the same. Next, difference DEM (dDEM) is computed by subtracting two DEMs. It is assumed that dDEM, which only contains random error, obeys normal distribution. Therefore, student test is implemented for blunder detection and three kinds of rejected grid points are generated. First kind of rejected grid points is blunder points and has to be eliminated. Another one is the ones in change areas, where the latest data are regarded as their fusion result. Moreover, the DEM grid points of type I error are correct data and have to be reserved for fusion. The experiment result shows that using DEMs with terrain classification can obtain better blunder detection result. A proper setting of significant levels (α) can detect real blunders without creating too many type I errors. Weighting averaging is chosen as DEM fusion algorithm. The priori precisions estimated by our national DEM production guideline are applied to define weights. Fisher’s test is implemented to prove that the priori precisions correspond to the RMSEs of blunder detection result.

Download Full-text

Inconsistencies in Reported p-Values in Spanish Journals of Psychology

Methodology ◽

10.1027/1614-2241/a000107 ◽

2016 ◽

Vol 12 (2) ◽

pp. 44-51 ◽

Cited By ~ 1

Author(s):

José Manuel Caperos ◽

Ricardo Olmos ◽

Antonio Pardo

Keyword(s):

Type I Error ◽

Meta Analysis ◽

P Value ◽

Type I ◽

Simultaneous Inference ◽

Test Statistics ◽

Type I Errors ◽

P Values ◽

Editorial Boards ◽

Correlation Tests

Abstract. Correlation analysis is one of the most widely used methods to test hypotheses in social and health sciences; however, its use is not completely error free. We have explored the frequency of inconsistencies between reported p-values and the associated test statistics in 186 papers published in four Spanish journals of psychology (1,950 correlation tests); we have also collected information about the use of one- versus two-tailed tests in the presence of directional hypotheses, and about the use of some kind of adjustment to control Type I errors due to simultaneous inference. Reported correlation tests (83.8%) are incomplete and 92.5% include an inexact p-value. Gross inconsistencies, which are liable to alter the statistical conclusions, appear in 4% of the reviewed tests, and 26.9% of the inconsistencies found were large enough to bias the results of a meta-analysis. The election of one-tailed tests and the use of adjustments to control the Type I error rate are negligible. We therefore urge authors, reviewers, and editorial boards to pay particular attention to this in order to prevent inconsistencies in statistical reports.

Download Full-text

How Many Tiers Do We Need? Type I Errors and Power in Multiple Baseline Designs

Perspectives on Behavior Science ◽

10.1007/s40614-020-00263-x ◽

2020 ◽

Vol 43 (3) ◽

pp. 605-616 ◽

Cited By ~ 1

Author(s):

Marc J. Lanovaz ◽

Stéphanie Turgeon

Keyword(s):

Error Rate ◽

Type I Error ◽

Multiple Baseline ◽

Type I ◽

Design Quality ◽

Type I Errors ◽

Baseline Design ◽

Type I Error Rate ◽

Quality Guidelines ◽

Clear Change

Abstract Design quality guidelines typically recommend that multiple baseline designs include at least three demonstrations of effects. Despite its widespread adoption, this recommendation does not appear grounded in empirical evidence. The main purpose of our study was to address this issue by assessing Type I error rate and power in multiple baseline designs. First, we generated 10,000 multiple baseline graphs, applied the dual-criteria method to each tier, and computed Type I error rate and power for different number of tiers showing a clear change. Second, two raters categorized the tiers for 300 multiple baseline graphs to replicate our analyses using visual inspection. When multiple baseline designs had at least three tiers and two or more of these tiers showed a clear change, the Type I error rate remained adequate (< .05) while power also reached acceptable levels (> .80). In contrast, requiring all tiers to show a clear change resulted in overly stringent conclusions (i.e., unacceptably low power). Therefore, our results suggest that researchers and practitioners should carefully consider limitations in power when requiring all tiers of a multiple baseline design to show a clear change in their analyses.

Download Full-text

A Monte Carlo Study on the Performance of a Corrected Formula for ɛ̃ Suggested by Lecoutre

Journal of Educational Statistics ◽

10.3102/10769986019002119 ◽

1994 ◽

Vol 19 (2) ◽

pp. 119-126 ◽

Cited By ~ 3

Author(s):

Ru San Chen ◽

William P. Dunlap

Keyword(s):

Group Treatment ◽

Repeated Measures ◽

Type I Error ◽

Monte Carlo Study ◽

Error Rates ◽

Type I ◽

Type I Errors ◽

Repeated Measures Designs ◽

Number Of Treatment ◽

Present Simulation

Lecoutre (1991) has pointed out an error in the Huynh and Feldt (1976) formula for ɛ̃ used to adjust the degree of freedom for an approximate test in repeated measures designs with two or more independent groups. The present simulation study confirms that Lecoutre’s corrected ɛ̃ yields less biased estimation of population ɛ and reduces Type I error rates when compared to Huynh and Feldt’s (1976) ɛ̃. The increased accuracy in Type I errors for group-treatment interactions may become substantial when sample sizes are close to the number of treatment levels.

Download Full-text