scholarly journals Item-Score Reliability in Empirical-Data Sets and Its Relationship With Other Item Indices

2017 ◽  
Vol 78 (6) ◽  
pp. 998-1020 ◽  
Author(s):  
Eva A. O. Zijlmans ◽  
Jesper Tijmstra ◽  
L. Andries van der Ark ◽  
Klaas Sijtsma

Reliability is usually estimated for a total score, but it can also be estimated for item scores. Item-score reliability can be useful to assess the repeatability of an individual item score in a group. Three methods to estimate item-score reliability are discussed, known as method MS, method [Formula: see text], and method CA. The item-score reliability methods are compared with four well-known and widely accepted item indices, which are the item-rest correlation, the item-factor loading, the item scalability, and the item discrimination. Realistic values for item-score reliability in empirical-data sets are monitored to obtain an impression of the values to be expected in other empirical-data sets. The relation between the three item-score reliability methods and the four well-known item indices are investigated. Tentatively, a minimum value for the item-score reliability methods to be used in item analysis is recommended.

2018 ◽  
Vol 42 (7) ◽  
pp. 553-570 ◽  
Author(s):  
Eva A. O. Zijlmans ◽  
L. Andries van der Ark ◽  
Jesper Tijmstra ◽  
Klaas Sijtsma

Reliability is usually estimated for a test score, but it can also be estimated for item scores. Item-score reliability can be useful to assess the item’s contribution to the test score’s reliability, for identifying unreliable scores in aberrant item-score patterns in person-fit analysis, and for selecting the most reliable item from a test to use as a single-item measure. Four methods were discussed for estimating item-score reliability: the Molenaar–Sijtsma method (method MS), Guttman’s method [Formula: see text], the latent class reliability coefficient (method LCRC), and the correction for attenuation (method CA). A simulation study was used to compare the methods with respect to median bias, variability (interquartile range [IQR]), and percentage of outliers. The simulation study consisted of six conditions: standard, polytomous items, unequal [Formula: see text] parameters, two-dimensional data, long test, and small sample size. Methods MS and CA were the most accurate. Method LCRC showed almost unbiased results, but large variability. Method [Formula: see text] consistently underestimated item-score reliabilty, but showed a smaller IQR than the other methods.


2021 ◽  
Vol 7 ◽  
Author(s):  
Noema Gajdoš Kmecová ◽  
Barbara Pet'ková ◽  
Jana Kottferová ◽  
Rachel Sarah Wannell ◽  
Daniel Simon Mills

Using a popular method of behaviour evaluation which rates the intensity of behaviour in different contexts, we demonstrate how pooling item scores relating to a given construct can reveal different potential risk factors for the dependent variable depending on how the total score is constructed. We highlight how similar simple total scores can be constructed through very different combinations of constituent items. We argue for the importance of examining individual item score distributions, and the results from different intensity thresholds before deciding on the preferred method for calculating a meaningful dependent variable. We consider simply pooling individual item scores which conflate context with intensity to calculate an average score and assuming this represents a biologically meaningful measure of trait intensity is a fallacy. Specifically using four items that describe intercat aggression and eleven that describe playfulness in cats in Fe-BARQ, we found sex and neuter status, social play and fearfulness were consistently significant predictors for intercat aggression scores; and age, age when obtained, social play and fearfulness were significant predictors of playfulness scores. However, the significance of other factors such as scratching varied with the threshold used to calculate to the total score. We argue that some of these inconsistent variables may be biologically and clinically important and should not be considered random error. Instead they need to be evaluated in the context of other available evidence.


Methodology ◽  
2018 ◽  
Vol 14 (4) ◽  
pp. 156-164 ◽  
Author(s):  
Keith A. Markus

Abstract. Bollen and colleagues have advocated the use of formative scales despite the fact that formative scales lack an adequate underlying theory to guide development or validation such as that which underlies reflective scales. Three conceptual impediments impede the development of such theory: the redefinition of measurement restricted to the context of model fitting, the inscrutable notion of conceptual unity, and a systematic conflation of item scores with attributes. Setting aside these impediments opens the door to progress in developing the needed theory to support formative scale use. A broader perspective facilitates consideration of standard scale development concerns as applied to formative scales including scale development, item analysis, reliability, and item bias. While formative scales require a different pattern of emphasis, all five of the traditional sources of validity evidence apply to formative scales. Responsible use of formative scales requires greater attention to developing the requisite underlying theory.


2012 ◽  
Vol 9 (10) ◽  
pp. 13439-13496 ◽  
Author(s):  
M. J. Smith ◽  
M. C. Vanderwel ◽  
V. Lyutsarev ◽  
S. Emmott ◽  
D. W. Purves

Abstract. The feedback between climate and the terrestrial carbon cycle will be a key determinant of the dynamics of the Earth System over the coming decades and centuries. However Earth System Model projections of the terrestrial carbon-balance vary widely over these timescales. This is largely due to differences in their carbon cycle models. A major goal in biogeosciences is therefore to improve understanding of the terrestrial carbon cycle to enable better constrained projections. Essential to achieving this goal will be assessing the empirical support for alternative models of component processes, identifying key uncertainties and inconsistencies, and ultimately identifying the models that are most consistent with empirical evidence. To begin meeting these requirements we data-constrained all parameters of all component processes within a global terrestrial carbon model. Our goals were to assess the climate dependencies obtained for different component processes when all parameters have been inferred from empirical data, assess whether these were consistent with current knowledge and understanding, assess the importance of different data sets and the model structure for inferring those dependencies, assess the predictive accuracy of the model, and to identify a methodology by which alternative component models could be compared within the same framework in future. Although formulated as differential equations describing carbon fluxes through plant and soil pools, the model was fitted assuming the carbon pools were in states of dynamic equilibrium (input rates equal output rates). Thus, the parameterised model is of the equilibrium terrestrial carbon cycle. All but 2 of the 12 component processes to the model were inferred to have strong climate dependencies although it was not possible to data-constrain all parameters indicating some potentially redundant details. Similar climate dependencies were obtained for most processes whether inferred individually from their corresponding data sets or using the full terrestrial carbon model and all available data sets, indicating a strong overall consistency in the information provided by different data sets under the assumed model formulation. A notable exception was plant mortality, in which qualitatively different climate dependencies were inferred depending on the model formulation and data sets used, highlighting this component as the major structural uncertainty in the model. All but two component processes predicted empirical data better than a null model in which no climate dependency was assumed. Equilibrium plant carbon was predicted especially well (explaining around 70% of the variation in the withheld evaluation data). We discuss the advantages of our approach in relation to advancing our understanding of the carbon cycle and enabling Earth System Models make better constrained projections.


2021 ◽  
Author(s):  
Phalad Tipsrirach ◽  
Witoon Thacha ◽  
Prayuth Chusorn

This research aimed at creating a structural model of the indicators of Educational Leadership for Primary School Principals in Thailand, which is considered to be a theoretical model that has been used to test for coherence with the empirical data collected from a sample group of 580 participants, who were selected from 30,719 Primary School Principals from across the country. To create this theoretical structural model, a study of the suitability of the indicators was carried out so that it could be further used in the selection within the model, as well as in the model’s coherence test with the empirical data and in the investigation of the factor loading. The results of the research were as follows: Firstly, all indicators, which had been applied in the research were selected and were then placed into the theoretical structural model because the average and distribution coefficient values were as set in the criteria. Secondly, the theoretical model is coherent with the empirical data as the values of relative Chi-square, Root Mean Square Error of Approximation, Goodness-of-Fit Index, Adjusted Goodness-of-Fit Index, Comparative Fit Index, and Normed Fit Index were as set in the criteria. Finally, the factor loadings of the key elements, sub-elements, and the indicators were as set in the criteria. This showed that the theoretical model from this research can be beneficial for the research population with construct validity.


2011 ◽  
Vol 199 (4) ◽  
pp. 275-280 ◽  
Author(s):  
Takefumi Suzuki ◽  
Gary Remington ◽  
Tamara Arenovich ◽  
Hiroyuki Uchida ◽  
Ofer Agid ◽  
...  

BackgroundImprovements are greatest in the earlier weeks of antipsychotic treatment of patients with non-resistant schizophrenia.AimsTo address the early time-line for improvement with antipsychotics in treatment-resistant schizophrenia.MethodRandomised double-blind trials of antipsychotic medication in adult patients with treatment-resistant schizophrenia were investigated (last search June 2010). A series of meta-regression analyses were carried out to examine the effect of time on the average item scores in the Positive and Negative Syndrome Scale (PANSS) or Brief Psychiatric Rating Scale (BPRS) at three or more distinct time points within the first 6 weeks of treatment.ResultsStudy duration varied from 4 weeks to 1 year and the definitions of treatment resistance as well as of treatment response were not necessarily consistent across 19 identified studies, resulting in highly variable rates of response (0–76%). The mean standardised baseline item score in the PANSS or BPRS was 3.4 (s.e. = 0.06) in the five studies included in the meta-regression analysis, with the average baseline Clinical Global Impression – Severity score being 5.2 (marked illness). For the pooled population treated with a range of antipsychotics (n = 1019), significant reductions in the mean item scores occurred during the first 4 weeks; improvements observed in later weeks were smaller and non-significant. In contrast, weekly improvement with clozapine was significant throughout (n = 356).ConclusionsOur findings provide preliminary evidence that the majority of improvement with antipsychotics may occur relatively early. More consistent improvements with clozapine may be associated with a gradual titration. To further elucidate response patterns, future studies are needed to provide data over regular intervals during earlier stages of treatment.


Author(s):  
Lawrence Leemis

This chapter switches from the traditional analysis of Benford's law using data sets to a search for probability distributions that obey Benford's law. It begins by briefly discussing the origins of Benford's law through the independent efforts of Simon Newcomb (1835–1909) and Frank Benford, Jr. (1883–1948), both of whom made their discoveries through empirical data. Although Benford's law applies to a wide variety of data sets, none of the popular parametric distributions, such as the exponential and normal distributions, agree exactly with Benford's law. The chapter thus highlights the failures of several of these well-known probability distributions in conforming to Benford's law, considers what types of probability distributions might produce data that obey Benford's law, and looks at some of the geometry associated with these probability distributions.


CNS Spectrums ◽  
2018 ◽  
Vol 23 (1) ◽  
pp. 80-81
Author(s):  
Steven R. Pliszka ◽  
Valerie K. Arnold ◽  
Andrea Marraffino ◽  
Norberto J. DeSousa ◽  
Bev Incledon ◽  
...  

AbstractObjectiveIn a phase 3 trial of children with ADHD, DR/ER-MPH (formerly HLD200), a delayed-release and extended-release methylphenidate, improved ADHD symptoms and reduced at-home early morning and late afternoon/evening functional impairments versus placebo, as measured by the validated Parent Rating of Evening andMorning Behaviors-Revised, Morning (PREMB-R AM) and Evening (PREMB-R PM) subscales. This post hoc analysis evaluated the effect of DR/ER-MPH versus placebo onindividual PREMB-R AM/PM item scores.MethodData were analyzed from a pivotal, randomized, double-blind, multicenter, placebo-controlled, parallel-group, phase 3 trial of DR/ER-MPH in children (6-12 years) withADHD (NCT02520388). Using the 3-item PREMB-R AM and 8-item PREMB-R PM, both key secondary endpoints, investigators evaluated early morning and lateafternoon/evening functional impairment by scoring each item on a severity scale from 0 (none) to 3 (a lot). For post hoc analyses, treatment comparisons between DR/ER-MPH and placebo at endpoint were determined by using least squares mean changes from baseline on individual PREMB-R AM/PM items score derived from an analysis ofcovariance (ANCOVA) model with treatment as the main effect, and study center and baseline score as covariates.ResultsOf 163 children enrolled across 22 sites, 161 were included in the intent-to-treat population (DR/ER-MPH, n=81; placebo, n=80) and 138 completed the study. The mean DR/ER-MPH dose achieved after 3 weeks of treatment was 68.1 mg. Following 3 weeks of treatment, DR/ER-MPH significantly reduced mean individual item scores from baseline versus placebo on all PREMB-R AM items (all P≤0.002; “getting out of bed”, “getting ready”, and “arguing or struggling in the morning”). Additionally, DR/ER-MPH significantly reduced mean individual item scores from baseline on 5 out of 8 PREMB-R PM items (P<0.01 in 2 items [“sitting through dinner” and “playing quietly”] and P<0.05 in 3 items [“inattentive/distractible”, “transitioning between activities”, and “settling down/getting ready for bed”]). There was a trend towards a reduction on 2 other items of the PREMB-R PM (P<0.09). Distributions of the ratings for each item will be presented. No serious TEAEs were reported; TEAEs were consistent withmethylphenidate.ConclusionsPost hoc analyses revealed that DR/ER-MPH significantly reduced all PREMB-R AM item scores, including “getting out of bed”, and many PREMB-R PM items, including “getting ready for bed” in children with ADHD. These findings are worth further exploration.Funding AcknowledgementsIronshore Pharmaceuticals & Development, Inc.


1989 ◽  
Vol 65 (1) ◽  
pp. 155-160 ◽  
Author(s):  
Raymond Hubbard ◽  
Stuart J. Allen

Given nuances in the computer programs, unwary researchers performing a common factor analysis on the same set of data can be expected to arrive at very different conclusions regarding the number and nature of extracted factors if they use the BMDP, as opposed to the SPSSx (or SAS), statistical software package. This is illustrated using six well-known empirical data sets from the psychology literature.


1995 ◽  
Vol 81 (2) ◽  
pp. 371-379
Author(s):  
Daniel E. Boone

WAIS–R aging patterns were examined for a group of 200 psychiatric inpatients. Inpatients were grouped into six age categories: less than 24, 24–28, 29–32, 33–38, 39–43, and greater than 43 years. Verbal and Performance sums of scaled score, subtest scaled score, and raw score total, and individual item score means were examined for each age category. The classical aging pattern was observed wherein more crystallized cognitive abilities remained stable over age groups of the life span while more fluid abilities dropped sharply with their increasing ages. Results supported the decline in fluid cognitive abilities hypothesis for WAIS–R aging patterns advocated by Horn in 1985 and Kaufman in 1990.


Sign in / Sign up

Export Citation Format

Share Document