Extended Multivariate Generalizability Theory With Complex Design Structures

This article extends multivariate generalizability theory (MGT) to tests with different random-effects designs for each level of a fixed facet. There are numerous situations in which the design of a test and the resulting data structure are not definable by a single design. One example is mixed-format tests that are composed of multiple-choice and free-response items, with the latter involving variability attributable to both items and raters. In this case, two distinct designs are needed to fully characterize the design and capture potential sources of error associated with each item format. Another example involves tests containing both testlets and one or more stand-alone sets of items. Testlet effects need to be taken into account for the testlet-based items, but not the stand-alone sets of items. This article presents an extension of MGT that faithfully models such complex test designs, along with two real-data examples. Among other things, these examples illustrate that estimates of error variance, error–tolerance ratios, and reliability-like coefficients can be biased if there is a mismatch between the user-specified universe of generalization and the complex nature of the test.

Download Full-text

The Use of Multivariate Generalizability Theory to Evaluate the Quality of Subscores

Applied Psychological Measurement ◽

10.1177/0146621618758698 ◽

2018 ◽

Vol 42 (8) ◽

pp. 595-612 ◽

Cited By ~ 4

Author(s):

Zhehan Jiang ◽

Mark Raymond

Keyword(s):

Generalizability Theory ◽

Correlation Coefficients ◽

Real Data ◽

Simulation Experiments ◽

Experimental Conditions ◽

Pilot Studies ◽

Subsequent Discussion ◽

Multivariate Generalizability Theory ◽

Proportion Reduction

Conventional methods for evaluating the utility of subscores rely on reliability and correlation coefficients. However, correlations can overlook a notable source of variability: variation in subtest means/difficulties. Brennan introduced a reliability index for score profiles based on multivariate generalizability theory, designated as [Formula: see text], which is sensitive to variation in subtest difficulty. However, there has been little, if any, research evaluating the properties of this index. A series of simulation experiments, as well as analyses of real data, were conducted to investigate [Formula: see text] under various conditions of subtest reliability, subtest correlations, and variability in subtest means. Three pilot studies evaluated [Formula: see text] in the context of a single group of examinees. Results of the pilots indicated that [Formula: see text] indices were typically low; across the 108 experimental conditions, [Formula: see text] ranged from .23 to .86, with an overall mean of 0.63. The findings were consistent with previous research, indicating that subscores often do not have interpretive value. Importantly, there were many conditions for which the correlation-based method known as proportion reduction in mean-square error (PRMSE; Haberman, 2006) indicated that subscores were worth reporting, but for which values of [Formula: see text] fell into the .50s, .60s, and .70s. The main study investigated [Formula: see text] within the context of score profiles for examinee subgroups. Again, not only [Formula: see text] indices were generally low, but it was also found that [Formula: see text] can be sensitive to subgroup differences when PRMSE is not. Analyses of real data and subsequent discussion address how [Formula: see text] can supplement PRMSE for characterizing the quality of subscores.

Download Full-text

Indices of Subscore Utility for Individuals and Subgroups Based on Multivariate Generalizability Theory

Educational and Psychological Measurement ◽

10.1177/0013164419846936 ◽

2019 ◽

Vol 80 (1) ◽

pp. 67-90

Author(s):

Mark R. Raymond ◽

Zhehan Jiang

Keyword(s):

Simulation Study ◽

Generalizability Theory ◽

Simulated Data ◽

Error Variance ◽

Group Differences ◽

True Score ◽

Profile Error ◽

Variance Formula ◽

Multivariate Generalizability Theory ◽

Observed Score

Conventional methods for evaluating the utility of subscores rely on traditional indices of reliability and on correlations among subscores. One limitation of correlational methods is that they do not explicitly consider variation in subtest means. An exception is an index of score profile reliability designated as [Formula: see text], which quantifies the ratio of true score profile variance to observed score profile variance. [Formula: see text] has been shown to be more sensitive than correlational methods to group differences in score profile utility. However, it is a group average, representing the expected value over a population of examinees. Just as score reliability varies across individuals and subgroups, one can expect that the reliability of score profiles will vary across examinees. This article proposes two conditional indices of score profile utility grounded in multivariate generalizability theory. The first is based on the ratio of observed profile variance to the profile variance that can be attributed to random error. The second quantifies the proportion of observed variability in a score profile that can be attributed to true score profile variance. The article describes the indices, illustrates their use with two empirical examples, and evaluates their properties with simulated data. The results suggest that the proposed estimators of profile error variance are consistent with the known error in simulated score profiles and that they provide information beyond that provided by traditional measures of subscore utility. The simulation study suggests that artificially large values of the indices could occur for about 5% to 8% of examinees. The article concludes by suggesting possible applications of the indices and discusses avenues for further research.

Download Full-text

Revision of a Criterion-Referenced Vocabulary Test Using Generalizability Theory

JALT Journal - JALT Journal 24.1 ◽

10.37546/jaltjj31.1-4 ◽

2009 ◽

Vol 31 (1) ◽

pp. 81

Author(s):

Takeaki Kumazawa

Keyword(s):

Generalizability Theory ◽

Classical Test Theory ◽

Performance Testing ◽

Error Variance ◽

Test Theory ◽

Vocabulary Test ◽

Multiple Sources ◽

Generalizability Coefficient ◽

Criterion Referenced ◽

Generalizability Study

Classical test theory (CTT) has been widely used to estimate the reliability of measurements. Generalizability theory (G theory), an extension of CTT, is a powerful statistical procedure, particularly useful for performance testing, because it enables estimating the percentages of persons variance and multiple sources of error variance. This study focuses on a generalizability study (G study) conducted to investigate such variance components for a paper-pencil multiple-choice vocabulary test used as a diagnostic pretest. Further, a decision study (D study) was conducted to compute the generalizability coefficient (G coefficient) for absolute decisions. The results of the G and D studies indicated that 46% of the total variance was due to the items effect; further, the G coefficient for absolute decisions was low. 古典的テスト理論は尺度の信頼性を測定するため広く用いられている。古典的テスト理論の応用である一般化可能性理論（G理論）は特にパフォーマンステストにおいて有効な分析手法であり、受験者と誤差の要因となる分散成分の割合を測定することができる。本研究では診断テストとして用いられた多岐選択式語彙テストの分散成分を測定するため一般化可能性研究（G研究）を行った。さらに、決定研究（D研究）では絶対評価に用いる一般化可能性係数を算出した。G研究とD研究の結果、項目の分散成分が全体の分散の46%を占め、また信頼度指数は高くなかった。

Download Full-text

The affectability of writing assessment scores: a G-theory analysis of rater, task, and scoring method contribution

Language Testing in Asia ◽

10.1186/s40468-021-00134-5 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Ali Khodi

Keyword(s):

Writing Assessment ◽

Generalizability Theory ◽

Error Variance ◽

Total Variance ◽

Scoring Method ◽

Writing Assessments ◽

Assessment Scores ◽

Peer Rating ◽

Reliable Assessment ◽

Writing Scores

AbstractThe present study attempted to to investigate factors which affect EFL writing scores through using generalizability theory (G-theory). To this purpose, one hundred and twenty students participated in one independent and one integrated writing tasks. Proceeding, their performances were scored by six raters: one self-rating, three peers,-rating and two instructors-rating. The main purpose of the sudy was to determine the relative and absolute contributions of different facets such as student, rater, task, method of scoring, and background of education to the validity of writing assessment scores. The results indicated three major sources of variance: (a) the student by task by method of scoring (nested in background of education) interaction (STM:B) with 31.8% contribution to the total variance, (b) the student by rater by task by method of scoring (nested in background of education) interaction (SRTM:B) with 26.5% of contribution to the total variance, and (c) the student by rater by method of scoring (nested in background of education) interaction (SRM:B) with 17.6% of the contribution. With regard to the G-coefficients in G-study (relative G-coefficient ≥ 0.86), it was also found that the result of the assessment was highly valid and reliable. The sources of error variance were detected as the student by rater (nested in background of education) (SR:B) and rater by background of education with 99.2% and 0.8% contribution to the error variance, respectively. Additionally, ten separate G-studies were conducted to investigate the contribution of different facets across rater, task, and methods of scoring as differentiation facet. These studies suggested that peer rating, analytical scoring method, and integrated writing tasks were the most reliable and generalizable designs of the writing assessments. Finally, five decision-making studies (D-studies) in optimization level were conducted and it was indicated that at least four raters (with G-coefficient = 0.80) are necessary for a valid and reliable assessment. Based on these results, to achieve the greatest gain in generalizability, teachers should have their students take two writing assessments and their performance should be rated on at least two scoring methods by at least four raters.

Download Full-text

A reassessment of the dimensionality of retail performance: a multivariate generalizability theory perspective

Journal of Retailing and Consumer Services ◽

10.1016/s0969-6989(03)00050-x ◽

2004 ◽

Vol 11 (4) ◽

pp. 235-245 ◽

Cited By ~ 12

Author(s):

Adam Finn

Keyword(s):

Generalizability Theory ◽

Multivariate Generalizability Theory

Download Full-text

DeepSimulator: a deep simulator for Nanopore sequencing

10.1101/238683 ◽

2017 ◽

Cited By ~ 1

Author(s):

Yu Li ◽

Renmin Han ◽

Chongwei Bi ◽

Mo Li ◽

Sheng Wang ◽

...

Keyword(s):

Deep Learning ◽

De Novo ◽

Real Data ◽

Electrical Current ◽

Complex Nature ◽

Main Task ◽

Nanopore Sequencing ◽

Base Calling ◽

Context Dependent ◽

Sequencing Procedure

ABSTRACTMotivationOxford Nanopore sequencing is a rapidly developed sequencing technology in recent years. To keep pace with the explosion of the downstream data analytical tools, a versatile Nanopore sequencing simulator is needed to complement the experimental data as well as to benchmark those newly developed tools. However, all the currently available simulators are based on simple statistics of the produced reads, which have difficulty in capturing the complex nature of the Nanopore sequencing procedure, the main task of which is the generation of raw electrical current signals.ResultsHere we propose a deep learning based simulator, DeepSimulator, to mimic the entire pipeline of Nanopore sequencing. Starting from a given reference genome or assembled contigs, we simulate the electrical current signals by a context-dependent deep learning model, followed by a base-calling procedure to yield simulated reads. This workflow mimics the sequencing procedure more naturally. The thorough experiments performed across four species show that the signals generated by our context-dependent model are more similar to the experimentally obtained signals than the ones generated by the official context-independent pore model. In terms of the simulated reads, we provide a parameter interface to users so that they can obtain the reads with different accuracies ranging from 83% to 97%. The reads generated by the default parameter have almost the same properties as the real data. Two case studies demonstrate the application of DeepSimulator to benefit the development of tools in de novo assembly and in low coverage SNP detection.AvailabilityThe software can be accessed freely at: https://github.com/lykaust15/deep_simulator.

Download Full-text

Initial Validation Evidence for Clinical Case Presentations by Student Pharmacists

INNOVATIONS in pharmacy ◽

10.24926/iip.v12i1.2136 ◽

2021 ◽

Vol 12 (1) ◽

pp. 18

Author(s):

Jennifer S Byrd ◽

Michael J Peeters

Keyword(s):

Clinical Case ◽

Rating Scale ◽

Generalizability Theory ◽

Rasch Measurement ◽

Care Plan ◽

Point Rating Scale ◽

Performance Based Assessment ◽

Multiple Case ◽

Case Presentations ◽

Multivariate Generalizability Theory

Objective: There is a paucity of validation evidence for assessing clinical case-presentations by Doctor of Pharmacy (PharmD) students. Within Kane’s Framework for Validation, evidence for inferences of scoring and generalization should be generated first. Thus, our objectives were to characterize and improve scoring, as well as build initial generalization evidence, in order to provide validation evidence for performance-based assessment of clinical case-presentations. Design: Third-year PharmD students worked up patient-cases from a local hospital. Students orally presented and defended their therapeutic care-plan to pharmacist preceptors (evaluators) and fellow students. Evaluators scored each presentation using an 11-item instrument with a 6-point rating-scale. In addition, evaluators scored a global-item with a 4-point rating-scale. Rasch Measurement was used for scoring analysis, while Generalizability Theory was used for generalization analysis. Findings: Thirty students each presented five cases that were evaluated by 15 preceptors using an 11-item instrument. Using Rasch Measurement, the 11-item instrument’s 6-point rating-scale did not work; it only worked once collapsed to a 4-point rating-scale. This revised 11-item instrument also showed redundancy. Alternatively, the global-item performed reasonably on its own. Using multivariate Generalizability Theory, the g-coefficient (reliability) for the series of five case-presentations was 0.76 with the 11-item instrument, and 0.78 with the global-item. Reliability was largely dependent on multiple case-presentations and, to a lesser extent, the number of evaluators per case-presentation. Conclusions: Our pilot results confirm that scoring should be simple (scale and instrument). More specifically, the longer 11-item instrument measured but had redundancy, whereas the single global-item provided measurement over multiple case-presentations. Further, acceptable reliability can be balanced between more/fewer case-presentations and using more/fewer evaluators.

Download Full-text

Using a linear mixed-effect model framework to estimate multivariate generalizability theory parameters in R

Behavior Research Methods ◽

10.3758/s13428-020-01399-z ◽

2020 ◽

Vol 52 (6) ◽

pp. 2383-2393

Author(s):

Zhehan Jiang ◽

Mark Raymond ◽

Dexin Shi ◽

Christine DiStefano

Keyword(s):

Generalizability Theory ◽

Linear Mixed Effect Model ◽

Mixed Effect Model ◽

Model Framework ◽

Mixed Effect ◽

Linear Mixed Effect ◽

Effect Model ◽

Multivariate Generalizability Theory

Download Full-text

Some Vicissitudes of Chain P-Factor Analysis: Criterion of Meaningfulness

Psychological Reports ◽

10.2466/pr0.1967.21.3.1005 ◽

1967 ◽

Vol 21 (3) ◽

pp. 1005-1013 ◽

Cited By ~ 3

Author(s):

Kenneth I. Howard ◽

James A. Hill

Keyword(s):

Factor Analysis ◽

Paired Comparison ◽

Real Data ◽

Error Variance ◽

True Score ◽

Monte Carlo Data ◽

Comparison Task ◽

The Real ◽

A Chain ◽

P Factor

In a Chain P-factor analysis the elimination of the between person variance reduces the contribution of “true score” variance to the true score-total score variance ratio based on the reduced scores. Factors which emerge in such an analysis may unduly reflect the influence of error variance and demand caution in their interpretation. An expanded criterion of meaningfulness was proposed which contrasted an obtained solution with a randomly generated solution under the null hypothesis that independent judges could not do better than chance in distinguishing the real factors from the random ones. A Chain P-analysis of real data gathered from 45 female patients, tested after each of 10 successive psychotherapy sessions, was contrasted with a parallel analysis of Monte-Carlo data. Four judges significantly discriminated the real factors from the random factors in a paired comparison task. The results strengthened the credibility of the Chain P-analysis and established the usefulness of an expanded criterion of meaningfulness.

Download Full-text

Maximizing Power in Generalizability Studies Under Budget Constraints

Journal of Educational Statistics ◽

10.3102/10769986018002197 ◽

1993 ◽

Vol 18 (2) ◽

pp. 197-206 ◽

Cited By ~ 11

Author(s):

George A. Marcoulides

Keyword(s):

Error Control ◽

Resource Constraints ◽

Generalizability Theory ◽

Error Variance ◽

Test Reliability ◽

Simple Task ◽

Mean Error ◽

Measurement Design ◽

The Mean ◽

Statistical Issues

Generalizability theory provides a framework for examining the dependability of behavioral measurements. When designing generalizability studies, two important statistical issues are generally considered: power and measurement error. Control over power and error of measurement can be obtained by manipulation of sample size and/or test reliability. In generalizability theory, the mean error variance is an estimate that takes into account both these statistical issues. When limited resources are available, determining an optimal measurement design is not a simple task. This article presents a methodology for minimizing mean error variance in generalizability studies when resource constraints are imposed.

Download Full-text