Extended Multivariate Generalizability Theory With Complex Design Structures

2021 ◽  
pp. 001316442110497
Author(s):  
Robert L. Brennan ◽  
Stella Y. Kim ◽  
Won-Chan Lee

This article extends multivariate generalizability theory (MGT) to tests with different random-effects designs for each level of a fixed facet. There are numerous situations in which the design of a test and the resulting data structure are not definable by a single design. One example is mixed-format tests that are composed of multiple-choice and free-response items, with the latter involving variability attributable to both items and raters. In this case, two distinct designs are needed to fully characterize the design and capture potential sources of error associated with each item format. Another example involves tests containing both testlets and one or more stand-alone sets of items. Testlet effects need to be taken into account for the testlet-based items, but not the stand-alone sets of items. This article presents an extension of MGT that faithfully models such complex test designs, along with two real-data examples. Among other things, these examples illustrate that estimates of error variance, error–tolerance ratios, and reliability-like coefficients can be biased if there is a mismatch between the user-specified universe of generalization and the complex nature of the test.

2018 ◽  
Vol 42 (8) ◽  
pp. 595-612 ◽  
Author(s):  
Zhehan Jiang ◽  
Mark Raymond

Conventional methods for evaluating the utility of subscores rely on reliability and correlation coefficients. However, correlations can overlook a notable source of variability: variation in subtest means/difficulties. Brennan introduced a reliability index for score profiles based on multivariate generalizability theory, designated as [Formula: see text], which is sensitive to variation in subtest difficulty. However, there has been little, if any, research evaluating the properties of this index. A series of simulation experiments, as well as analyses of real data, were conducted to investigate [Formula: see text] under various conditions of subtest reliability, subtest correlations, and variability in subtest means. Three pilot studies evaluated [Formula: see text] in the context of a single group of examinees. Results of the pilots indicated that [Formula: see text] indices were typically low; across the 108 experimental conditions, [Formula: see text] ranged from .23 to .86, with an overall mean of 0.63. The findings were consistent with previous research, indicating that subscores often do not have interpretive value. Importantly, there were many conditions for which the correlation-based method known as proportion reduction in mean-square error (PRMSE; Haberman, 2006) indicated that subscores were worth reporting, but for which values of [Formula: see text] fell into the .50s, .60s, and .70s. The main study investigated [Formula: see text] within the context of score profiles for examinee subgroups. Again, not only [Formula: see text] indices were generally low, but it was also found that [Formula: see text] can be sensitive to subgroup differences when PRMSE is not. Analyses of real data and subsequent discussion address how [Formula: see text] can supplement PRMSE for characterizing the quality of subscores.


2019 ◽  
Vol 80 (1) ◽  
pp. 67-90
Author(s):  
Mark R. Raymond ◽  
Zhehan Jiang

Conventional methods for evaluating the utility of subscores rely on traditional indices of reliability and on correlations among subscores. One limitation of correlational methods is that they do not explicitly consider variation in subtest means. An exception is an index of score profile reliability designated as [Formula: see text], which quantifies the ratio of true score profile variance to observed score profile variance. [Formula: see text] has been shown to be more sensitive than correlational methods to group differences in score profile utility. However, it is a group average, representing the expected value over a population of examinees. Just as score reliability varies across individuals and subgroups, one can expect that the reliability of score profiles will vary across examinees. This article proposes two conditional indices of score profile utility grounded in multivariate generalizability theory. The first is based on the ratio of observed profile variance to the profile variance that can be attributed to random error. The second quantifies the proportion of observed variability in a score profile that can be attributed to true score profile variance. The article describes the indices, illustrates their use with two empirical examples, and evaluates their properties with simulated data. The results suggest that the proposed estimators of profile error variance are consistent with the known error in simulated score profiles and that they provide information beyond that provided by traditional measures of subscore utility. The simulation study suggests that artificially large values of the indices could occur for about 5% to 8% of examinees. The article concludes by suggesting possible applications of the indices and discusses avenues for further research.


2009 ◽  
Vol 31 (1) ◽  
pp. 81
Author(s):  
Takeaki Kumazawa

Classical test theory (CTT) has been widely used to estimate the reliability of measurements. Generalizability theory (G theory), an extension of CTT, is a powerful statistical procedure, particularly useful for performance testing, because it enables estimating the percentages of persons variance and multiple sources of error variance. This study focuses on a generalizability study (G study) conducted to investigate such variance components for a paper-pencil multiple-choice vocabulary test used as a diagnostic pretest. Further, a decision study (D study) was conducted to compute the generalizability coefficient (G coefficient) for absolute decisions. The results of the G and D studies indicated that 46% of the total variance was due to the items effect; further, the G coefficient for absolute decisions was low. 古典的テスト理論は尺度の信頼性を測定するため広く用いられている。古典的テスト理論の応用である一般化可能性理論(G理論)は特にパフォーマンステストにおいて有効な分析手法であり、受験者と誤差の要因となる分散成分の割合を測定することができる。本研究では診断テストとして用いられた多岐選択式語彙テストの分散成分を測定するため一般化可能性研究(G研究)を行った。さらに、決定研究(D研究)では絶対評価に用いる一般化可能性係数を算出した。G研究とD研究の結果、項目の分散成分が全体の分散の46%を占め、また信頼度指数は高くなかった。


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Ali Khodi

AbstractThe present study attempted to to investigate  factors  which affect EFL writing scores through using generalizability theory (G-theory). To this purpose, one hundred and twenty students participated in one independent and one integrated writing tasks. Proceeding, their performances were scored by six raters: one self-rating,  three peers,-rating and two instructors-rating. The main purpose of the sudy was to determine the relative and absolute contributions of different facets such as student, rater, task, method of scoring, and background of education  to the validity of writing assessment scores. The results indicated three major sources of variance: (a) the student by task by method of scoring (nested in background of education) interaction (STM:B) with 31.8% contribution to the total variance, (b) the student by rater by task by method of scoring (nested in background of education) interaction (SRTM:B) with 26.5% of contribution to the total variance, and (c) the student by rater by method of scoring (nested in background of education) interaction (SRM:B) with 17.6% of the contribution. With regard to the G-coefficients in G-study (relative G-coefficient ≥ 0.86), it was also found that the result of the assessment was highly valid and reliable. The sources of error variance were detected as the student by rater (nested in background of education) (SR:B) and rater by background of education with 99.2% and 0.8% contribution to the error variance, respectively. Additionally, ten separate G-studies were conducted to investigate the contribution of different facets across rater, task, and methods of scoring as differentiation facet. These studies suggested that peer rating, analytical scoring method, and integrated writing tasks were the most reliable and generalizable designs of the writing assessments. Finally, five decision-making studies (D-studies) in optimization level were conducted and it was indicated that at least four raters (with G-coefficient = 0.80) are necessary for a valid and reliable assessment. Based on these results, to achieve the greatest gain in generalizability, teachers should have their students take two writing assessments and their performance should be rated on at least two scoring methods by at least four raters.


2017 ◽  
Author(s):  
Yu Li ◽  
Renmin Han ◽  
Chongwei Bi ◽  
Mo Li ◽  
Sheng Wang ◽  
...  

ABSTRACTMotivationOxford Nanopore sequencing is a rapidly developed sequencing technology in recent years. To keep pace with the explosion of the downstream data analytical tools, a versatile Nanopore sequencing simulator is needed to complement the experimental data as well as to benchmark those newly developed tools. However, all the currently available simulators are based on simple statistics of the produced reads, which have difficulty in capturing the complex nature of the Nanopore sequencing procedure, the main task of which is the generation of raw electrical current signals.ResultsHere we propose a deep learning based simulator, DeepSimulator, to mimic the entire pipeline of Nanopore sequencing. Starting from a given reference genome or assembled contigs, we simulate the electrical current signals by a context-dependent deep learning model, followed by a base-calling procedure to yield simulated reads. This workflow mimics the sequencing procedure more naturally. The thorough experiments performed across four species show that the signals generated by our context-dependent model are more similar to the experimentally obtained signals than the ones generated by the official context-independent pore model. In terms of the simulated reads, we provide a parameter interface to users so that they can obtain the reads with different accuracies ranging from 83% to 97%. The reads generated by the default parameter have almost the same properties as the real data. Two case studies demonstrate the application of DeepSimulator to benefit the development of tools in de novo assembly and in low coverage SNP detection.AvailabilityThe software can be accessed freely at: https://github.com/lykaust15/deep_simulator.


2021 ◽  
Vol 12 (1) ◽  
pp. 18
Author(s):  
Jennifer S Byrd ◽  
Michael J Peeters

Objective: There is a paucity of validation evidence for assessing clinical case-presentations by Doctor of Pharmacy (PharmD) students.  Within Kane’s Framework for Validation, evidence for inferences of scoring and generalization should be generated first.  Thus, our objectives were to characterize and improve scoring, as well as build initial generalization evidence, in order to provide validation evidence for performance-based assessment of clinical case-presentations. Design: Third-year PharmD students worked up patient-cases from a local hospital.  Students orally presented and defended their therapeutic care-plan to pharmacist preceptors (evaluators) and fellow students.  Evaluators scored each presentation using an 11-item instrument with a 6-point rating-scale.  In addition, evaluators scored a global-item with a 4-point rating-scale.  Rasch Measurement was used for scoring analysis, while Generalizability Theory was used for generalization analysis. Findings: Thirty students each presented five cases that were evaluated by 15 preceptors using an 11-item instrument.  Using Rasch Measurement, the 11-item instrument’s 6-point rating-scale did not work; it only worked once collapsed to a 4-point rating-scale.  This revised 11-item instrument also showed redundancy.  Alternatively, the global-item performed reasonably on its own.  Using multivariate Generalizability Theory, the g-coefficient (reliability) for the series of five case-presentations was 0.76 with the 11-item instrument, and 0.78 with the global-item.  Reliability was largely dependent on multiple case-presentations and, to a lesser extent, the number of evaluators per case-presentation.  Conclusions: Our pilot results confirm that scoring should be simple (scale and instrument).  More specifically, the longer 11-item instrument measured but had redundancy, whereas the single global-item provided measurement over multiple case-presentations.  Further, acceptable reliability can be balanced between more/fewer case-presentations and using more/fewer evaluators.


1967 ◽  
Vol 21 (3) ◽  
pp. 1005-1013 ◽  
Author(s):  
Kenneth I. Howard ◽  
James A. Hill

In a Chain P-factor analysis the elimination of the between person variance reduces the contribution of “true score” variance to the true score-total score variance ratio based on the reduced scores. Factors which emerge in such an analysis may unduly reflect the influence of error variance and demand caution in their interpretation. An expanded criterion of meaningfulness was proposed which contrasted an obtained solution with a randomly generated solution under the null hypothesis that independent judges could not do better than chance in distinguishing the real factors from the random ones. A Chain P-analysis of real data gathered from 45 female patients, tested after each of 10 successive psychotherapy sessions, was contrasted with a parallel analysis of Monte-Carlo data. Four judges significantly discriminated the real factors from the random factors in a paired comparison task. The results strengthened the credibility of the Chain P-analysis and established the usefulness of an expanded criterion of meaningfulness.


1993 ◽  
Vol 18 (2) ◽  
pp. 197-206 ◽  
Author(s):  
George A. Marcoulides

Generalizability theory provides a framework for examining the dependability of behavioral measurements. When designing generalizability studies, two important statistical issues are generally considered: power and measurement error. Control over power and error of measurement can be obtained by manipulation of sample size and/or test reliability. In generalizability theory, the mean error variance is an estimate that takes into account both these statistical issues. When limited resources are available, determining an optimal measurement design is not a simple task. This article presents a methodology for minimizing mean error variance in generalizability studies when resource constraints are imposed.


Sign in / Sign up

Export Citation Format

Share Document