Reframing rankings in educational assessments

Abstract. In large-scale educational assessments such as the Third International Mathematics and Sciences Study (TIMSS) or the Program for International Student Assessment (PISA), sizeable numbers of test administrators (TAs) are needed to conduct the assessment sessions in the participating schools. TA training sessions are run and administration manuals are compiled with the aim of ensuring standardized, comparable, assessment situations in all student groups. To date, however, there has been no empirical investigation of the effectiveness of these standardizing efforts. In the present article, we probe for systematic TA effects on mathematics achievement and sample attrition in a student achievement study. Multilevel analyses for cross-classified data using Markov Chain Monte Carlo (MCMC) procedures were performed to separate the variance that can be attributed to differences between schools from the variance associated with TAs. After controlling for school effects, only a very small, nonsignificant proportion of the variance in mathematics scores and response behavior was attributable to the TAs (< 1%). We discuss practical implications of these findings for the deployment of TAs in educational assessments.

Download Full-text

The impact of students’ test-taking effort on growth estimates in low-stakes educational assessments

Educational Research and Evaluation ◽

10.1080/13803611.2021.1977152 ◽

2021 ◽

pp. 1-19

Author(s):

Seyma Nur Yildirim-Erbasli ◽

Okan Bulut

Keyword(s):

Test Taking ◽

Educational Assessments ◽

The Impact

Download Full-text

A Longitudinal Analysis of Doctoral Graduate Supply in the Educational Measurement Field

10.35542/osf.io/yzdbc ◽

2020 ◽

Author(s):

Jennifer Randall ◽

Joseph Rios ◽

Hyun Joo Jung

Keyword(s):

Educational Measurement ◽

Educational Institutions ◽

Supply Side ◽

Annual Growth Rate ◽

Doctoral Graduates ◽

Doctoral Graduate ◽

Black Graduates ◽

National Science ◽

Educational Assessments ◽

International Graduates

For nearly three decades, researchers have been concerned that the educational measurement field is not producing enough graduate-level specialists to meet the growing demand driven by the increased use of educational assessments in the U.S. This study examined the supply-side aspect of the proposed labor shortage by relying on data from the National Science Foundation’s Survey of Earned Doctorates collected between 1997 and 2016. Over the 20 years examined, measurement programs produced 3,124 doctoral graduates, and across this time span, the annual production of graduates nearly doubled. This supply expansion can largely be attributed to the increase in the number of international graduates, which outpaced the annual growth rate of domestic PhD recipients by 156%. Moreover, 85% of graduates were found to either self-identify as White or Asian. Less than 10 Hispanic and no more than 20 Black graduates were produced in any of the years examined. Of the 76% of graduates that reported having a job offer or accepted a position upon graduation, most entered the academy despite the overall average starting salary ($59,484) being considerably lower than the starting salary for their counterparts entering industry ($84,918), government ($69,970), or other educational institutions ($81,428).

Download Full-text

Prediction of school outcome after preterm birth: a cohort study

Archives of Disease in Childhood ◽

10.1136/archdischild-2018-315441 ◽

2018 ◽

Vol 104 (4) ◽

pp. 348-353 ◽

Cited By ~ 2

Author(s):

David Odd ◽

David Evans ◽

Alan M Emond

Keyword(s):

Preterm Infants ◽

Linear Regression Models ◽

Term Infants ◽

Educational Trajectories ◽

Parents And Children ◽

Early Schooling ◽

Educational Assessments ◽

Educational Journal ◽

Gestational Group ◽

School Outcome

ObjectiveTo identify if the educational trajectories of preterm infants differ from those of their term peers.DesignThis work is based on the Avon Longitudinal Study of Parents and Children (ALSPAC). Educational measures were categorised into 10 deciles to allow comparison of measures across time periods. Gestational age was categorised as preterm (23–36 weeks) or term (37–42 weeks). Multilevel mixed-effects linear regression models were derived to examine the trajectories of decile scores across the study period. Gestational group was added as an interaction term to assess if the trajectory between educational measures varied between preterm and term infants. Adjustment for possible confounders was performed.SubjectsThe final dataset contained information on 12 586 infants born alive at between 23 weeks and 42 weeks of gestation.Main outcome measuresUK mandatory educational assessments (SATs) scores throughout educational journal (including final GCSE results at 16 years of age).ResultsPreterm infants had on average lower Key Stage (KS) scores than term children (−0.46 (−0.84 to −0.07)). However, on average, they gained on their term peers in each progressive measure (0.10 (0.01 to 0.19)), suggesting ‘catch up’ during the first few years at school. Preterm infants appeared to exhibit the increase in decile scores mostly between KS1 and KS2 (p=0.005) and little between KS2 and KS3 (p=0.182) or KS3 and KS4 (p=0.149).ConclusionsThis work further emphasises the importance of early schooling and environment in these infants and suggests that support, long after the premature birth, may have additional benefits.

Download Full-text

Validity and Accountability: Test Validation for 21st-Century Educational Assessments

Meeting the Challenges to Measurement in an Era of Accountability ◽

10.4324/9780203781302-13 ◽

2016 ◽

pp. 159-177

Keyword(s):

21St Century ◽

Test Validation ◽

Educational Assessments

Download Full-text

Validity, Purpose and the Recycling of Results from Educational Assessments

Assessment and Learning ◽

10.4135/9781446250808.n16 ◽

2012 ◽

pp. 264-276 ◽

Cited By ~ 2

Author(s):

Paul E. Newton

Keyword(s):

Educational Assessments

Download Full-text

Approaches and educational assessments of children’s speech, language and communication development in Swedish preschools

Early Child Development and Care ◽

10.1080/03004430.2019.1697693 ◽

2019 ◽

pp. 1-16 ◽

Cited By ~ 1

Author(s):

Ann Nordberg ◽

Katharina Jacobsson

Keyword(s):

Communication Development ◽

Language And Communication ◽

Educational Assessments ◽

Children's Speech

Download Full-text

Accumulating and visualising tacit knowledge of teachers on educational assessments

Computers & Education ◽

10.1016/j.compedu.2011.06.018 ◽

2011 ◽

Vol 57 (4) ◽

pp. 2212-2223 ◽

Cited By ~ 7

Author(s):

Tzone-I. Wang ◽

Chien-Yuan Su ◽

Tung-Cheng Hsieh

Keyword(s):

Tacit Knowledge ◽

Educational Assessments

Download Full-text

Improving the ability of qualitative assessments to discriminate student achievement levels

Journal of International Education in Business ◽

10.1108/jieb-12-2013-0048 ◽

2015 ◽

Vol 8 (1) ◽

pp. 49-58 ◽

Cited By ~ 1

Author(s):

Jeffrey Chi Hoe Mok ◽

Anita Ann Lee Toh

Keyword(s):

Student Achievement ◽

Business Communication ◽

Assessment Instrument ◽

Control Group ◽

Content Type ◽

Communication Course ◽

Educational Assessments ◽

Criterion Referenced ◽

The One ◽

Achievement Levels

Purpose – This paper aims to investigate the use of blind marking to increase the ability of criterion-referenced marking to discriminate students’ varied levels of knowledge and skill mastery in a business communication skills course. Design/methodology/approach – The business communication course in this study involved more than 10 teachers and 350 students each semester. Data were collected from four semesters of assignment grades to compare the distribution of grades in semesters that used blind marking and in the one that did not (the control group). The standard deviations of marks for each assignment were calculated and compared. Findings – Findings show that blind marking contributed to a wider spread of marks. The study concludes that blind marking, when implemented together with criterion-referenced marking rubrics, can improve the ability of qualitative assessments to discriminate student achievement levels. Originality/value – Research in the use of criterion-referenced marking rubrics has revealed that assessing with marking rubrics resulted in a wider range of marks awarded because assessors felt that the rubrics helped them make more objective judgments of students’ work (Kuisma, 1999). By this token, it could be argued that because blind marking allows more objective judgment of students’ work (by reducing rater bias), it seems to reason that marks might be awarded on a wider range of the marking scale. However, current literature on blind marking and grade/mark dispersion has yet to reveal a study on whether blind marking is able to increase the spread of marks, and therefore, indicate that an assessment instrument is effective is discriminating a range of student achievement levels. This paper should add to the current research on higher quality of educational assessments.

Download Full-text

Model Diagnostics for Bayesian Networks

Journal of Educational and Behavioral Statistics ◽

10.3102/10769986031001001 ◽

2006 ◽

Vol 31 (1) ◽

pp. 1-33 ◽

Cited By ~ 22

Author(s):

Sandip Sinharay

Keyword(s):

Model Checking ◽

Bayesian Networks ◽

Real Data ◽

Diagnostic Tools ◽

Model Diagnostics ◽

Posterior Predictive Model Checking ◽

Item Fit ◽

Item Functioning ◽

Data Application ◽

Educational Assessments

Bayesian networks are frequently used in educational assessments primarily for learning about students’ knowledge and skills. There is a lack of works on assessing fit of Bayesian networks. This article employs the posterior predictive model checking method, a popular Bayesian model checking tool, to assess fit of simple Bayesian networks. A number of aspects of model fit, those of usual interest to practitioners, are assessed using various diagnostic tools. This article suggests a direct data display for assessing overall fit, suggests several diagnostics for assessing item fit, suggests a graphical approach to examine if the model can explain the association among the items, and suggests a version of the Mantel–Haenszel statistic for assessing differential item functioning. Limited simulation studies and a real data application demonstrate the effectiveness of the suggested model diagnostics.

Download Full-text