Are There Test Administrator Effects in Large-Scale Educational Assessments?

Methodology ◽  
2007 ◽  
Vol 3 (4) ◽  
pp. 149-159 ◽  
Author(s):  
Oliver Lüdtke ◽  
Alexander Robitzsch ◽  
Ulrich Trautwein ◽  
Frauke Kreuter ◽  
Jan Marten Ihme

Abstract. In large-scale educational assessments such as the Third International Mathematics and Sciences Study (TIMSS) or the Program for International Student Assessment (PISA), sizeable numbers of test administrators (TAs) are needed to conduct the assessment sessions in the participating schools. TA training sessions are run and administration manuals are compiled with the aim of ensuring standardized, comparable, assessment situations in all student groups. To date, however, there has been no empirical investigation of the effectiveness of these standardizing efforts. In the present article, we probe for systematic TA effects on mathematics achievement and sample attrition in a student achievement study. Multilevel analyses for cross-classified data using Markov Chain Monte Carlo (MCMC) procedures were performed to separate the variance that can be attributed to differences between schools from the variance associated with TAs. After controlling for school effects, only a very small, nonsignificant proportion of the variance in mathematics scores and response behavior was attributable to the TAs (< 1%). We discuss practical implications of these findings for the deployment of TAs in educational assessments.

2021 ◽  
Vol 33 (1) ◽  
pp. 139-167
Author(s):  
Andrés Strello ◽  
Rolf Strietholt ◽  
Isa Steinmann ◽  
Charlotte Siepmann

AbstractResearch to date on the effects of between-school tracking on inequalities in achievement and on performance has been inconclusive. A possible explanation is that different studies used different data, focused on different domains, and employed different measures of inequality. To address this issue, we used all accumulated data collected in the three largest international assessments—PISA (Programme for International Student Assessment), PIRLS (Progress in International Reading Literacy Study), and TIMSS (Trends in International Mathematics and Science Study)—in the past 20 years in 75 countries and regions. Following the seminal paper by Hanushek and Wößmann (2006), we combined data from a total of 21 cycles of primary and secondary school assessments to estimate difference-in-differences models for different outcome measures. We synthesized the effects using a meta-analytical approach and found strong evidence that tracking increased social achievement gaps, that it had smaller but still significant effects on dispersion inequalities, and that it had rather weak effects on educational inadequacies. In contrast, we did not find evidence that tracking increased performance levels. Besides these substantive findings, our study illustrated that the effect estimates varied considerably across the datasets used because the low number of countries as the units of analysis was a natural limitation. This finding casts doubt on the reproducibility of findings based on single international datasets and suggests that researchers should use different data sources to replicate analyses.


2018 ◽  
Vol 26 (2) ◽  
pp. 213-226 ◽  
Author(s):  
Jörg Blasius

Purpose Evidence from past surveys suggests that some interviewees simplify their responses even in very well-organized and highly respected surveys. This paper aims to demonstrate that some interviewers, too, simplify their task by at least partly fabricating their data, and that, in some survey research institutes, employees simplify their task by fabricating entire interviews via copy and paste. Design/methodology/approach Using data from the principal questionnaires in the Programme for International Student Assessment (PISA) 2012 and the Programme for the International Assessment of Adult Competencies (PIAAC) data, the author applies statistical methods to search for fraudulent methods used by interviewers and employees at survey research organizations. Findings The author provides empirical evidence for potential fraud performed by interviewers and employees of survey research organizations in several countries that participated in PISA 2012 and PIAAC. Practical implications The proposed methods can be used as early as the initial phase of fieldwork to flag potentially problematic interviewer behavior such as copying responses. Originality/value The proposed methodology may help to improve data quality in survey research by detecting fabricated data.


2019 ◽  
Vol 44 (6) ◽  
pp. 752-781
Author(s):  
Michael O. Martin ◽  
Ina V.S. Mullis

International large-scale assessments of student achievement such as International Association for the Evaluation of Educational Achievement’s Trends in International Mathematics and Science Study (TIMSS) and Progress in International Reading Literacy Study and Organization for Economic Cooperation and Development’s Program for International Student Assessment that have come to prominence over the past 25 years owe a great deal in methodological terms to pioneering work by National Assessment of Educational Progress (NAEP). Using TIMSS as an example, this article describes how a number of core techniques, such as matrix sampling, student population sampling, item response theory scaling with population modeling, and resampling methods for variance estimation, have been adapted and implemented in an international context and are fundamental to the international assessment effort. In addition to the methodological contributions of NAEP, this article illustrates how the large-scale international assessments go beyond measuring student achievement by representing important aspects of community, home, school, and classroom contexts in ways that can be used to address issues of importance to researchers and policymakers.


2019 ◽  
Vol 20 (1) ◽  
pp. 45-65
Author(s):  
Sam P.E. Hopp

The Programme for International Student Assessment (PISA) scores are a leading international measure of achievement. This study reviews German 2015 PISA data and imputes scores on income and time in nation to provide comparisons between native, immigrant and refugee students. This quantitative study uses cultural capital to explain the association of independent variables to PISA scores for students, revealing an unexpected negative linear relationship between those variables. The results and significance of this study may assist those involved in policy for refugee populations and inform the strategies of test protocols and measures in a new global student paradigm.


2020 ◽  
pp. 249-263
Author(s):  
Luisa Araújo ◽  
Patrícia Costa ◽  
Nuno Crato

AbstractThis chapter provides a short description of what the Programme for International Student Assessment (PISA) measures and how it measures it. First, it details the concepts associated with the measurement of student performance and the concepts associated with capturing student and school characteristics and explains how they compare with some other International Large-Scale Assessments (ILSA). Second, it provides information on the assessment of reading, the main domain in PISA 2018. Third, it provides information on the technical aspects of the measurements in PISA. Lastly, it offers specific examples of PISA 2018 cognitive items, corresponding domains (mathematics, science, and reading), and related performance levels.


2021 ◽  
Author(s):  
Alexander Robitzsch ◽  
Oliver Lüdtke

International large-scale assessments (LSAs) such as the Programme for International Student Assessment (PISA) provide important information about the distribution of student proficiencies across a wide range of countries. The repeated assessments of these content domains offer policymakers important information for evaluating educational reforms and received considerable attention from the media. Furthermore, the analytical strategies employed in LSAs often define methodological standards for applied researchers in the field. Hence, it is vital to critically reflect the conceptual foundations of analytical choices in LSA studies. This article discusses methodological challenges in selecting and specifying the scaling model used to obtain proficiency estimates from the individual student responses in LSA studies. We distinguish design-based inference from model-based inference. It is argued that for the official reporting of LSA results, design-based inference should be preferred because it allows for a clear definition of the target of inference (e.g., country mean achievement) and is less sensitive to specific modeling assumptions. More specifically, we discuss five analytical choices in the specification of the scaling model: (1) Specification of the functional form of item response functions, (2) the treatment of local dependencies and multidimensionality, (3) the consideration of test-taking behavior for estimating student ability, and the role of country differential items functioning (DIF) for (4) cross-country comparisons, and (5) trend estimation. This article's primary goal is to stimulate discussion about recently implemented changes and suggested refinements of the scaling models in LSA studies.


Methodology ◽  
2021 ◽  
Vol 17 (1) ◽  
pp. 22-38
Author(s):  
Jason C. Immekus

Within large-scale international studies, the utility of survey scores to yield meaningful comparative data hinges on the degree to which their item parameters demonstrate measurement invariance (MI) across compared groups (e.g., culture). To-date, methodological challenges have restricted the ability to test the measurement invariance of item parameters of these instruments in the presence of many groups (e.g., countries). This study compares multigroup confirmatory factor analysis (MGCFA) and alignment method to investigate the MI of the schoolwork-related anxiety survey across gender groups within the 35 Organisation for Economic Co-operation and Development (OECD) countries (gender × country) of the Programme for International Student Assessment 2015 study. Subsequently, the predictive validity of MGCFA and alignment-based factor scores for subsequent mathematics achievement are examined. Considerations related to invariance testing of noncognitive instruments with many groups are discussed.


2017 ◽  
Vol 28 (68) ◽  
pp. 344 ◽  
Author(s):  
Maria de Lourdes Haywanon Santos Araújo ◽  
Robinson Moreira Tenório

<p>O objetivo desta pesquisa consistiu em analisar como foram utilizados os resultados do Programa Internacional de Avaliação de Estudantes (PISA) no contexto educacional brasileiro. A revisão de literatura permitiu apontar a avaliação como um fator fundamental para a qualificação da educação, elaborar um panorama das pesquisas sobre o PISA no Brasil, além de propiciar discussões sobre a necessidade do uso dos resultados das avaliações em larga escala. A partir da análise documental e de entrevistas semiestruturadas, foi possível não apenas apresentar um estudo sobre o uso dos resultados do PISA no país, mas também estabelecer categorias de usos como o Uso Indevido ou Não Uso, apresentando as possibilidades e dificuldades dessa utilização e o papel dos gestores nesse processo.</p><p><strong>Palavras-chave:</strong> Pisa; Uso de Resultados; Avaliação Educacional; Políticas Públicas.</p><p> </p><p><strong><em>Resultados brasileños en el PISA y sus (des)usos</em></strong></p><p><em>El objetivo de este estudio consistió en analizar cómo se utilizaron los resultados del Programa Internacional de Evaluación de Estudiantes (PISA) en el marco educacional brasileño. La revisión de literatura permitió que la evaluación se considerase como un factor fundamental para la cualificación de la educación y se elaborase un panorama de las investigaciones sobre PISA en Brasil, además de propiciar discusiones sobre la necesidad del uso de los resultados de las evaluaciones en gran escala. A partir del análisis documental y de entrevistas semiestructuradas, se hizo posible no solo presentar un estudio sobre el uso de los resultados de PISA en el país, sino también establecer categorías de usos, como el Uso Indebido o No Uso, presentando las posibilidades y dificultades de dicha utilización y el papel de los gestores en este proceso.</em></p><p><em><strong>Palabras-clave:</strong> PISA; Uso de Resultados; Evaluación Educacional; Políticas Públicas.</em></p><p><em> </em></p><p><strong><em>Brazilian results in PISA and its (mis)uses</em></strong></p><p><em>The objective of this study was to analyze how the results of the Program for International Student Assessment (PISA) were used in the Brazilian educational context. The literature review showed that assessment is a fundamental factor for the qualification of education, for elaborating an overview of the PISA studies in Brazil, as well as for promoting discussions about the need to use the results of evaluations on a large scale. Based on the documentary analysis and semi-structured interviews, it was possible not only to present a study on the use of the PISA results in the country but also to establish categories of uses, such as Improper Usage or Lack of Usage, showing the possibilities and difficulties of such use and the administrators’ role in this process.</em></p><p><em><strong>Keywords:</strong> PISA; Use of Results; Educational Assessment; Public Policies.</em></p>


2020 ◽  
Vol 26 (1) ◽  
pp. 20-32 ◽  
Author(s):  
Charlene Tan

This article examines a Confucian conception of competence and its corresponding response to the competencies agenda that underpins international large-scale assessments such as the Programme for International Student Assessment (PISA). It is argued that standardised transnational assessments is underpinned by technical rationality that emphasises proficiency in discrete skills for their instrumental worth at the expense of moral cultivation and personal mastery. Challenging the competencies agenda, this paper draws upon a relational model of competence proposed by Jones and Moore (1995) that views competence as essentially communal, situated within social practices, and manifested through tacit achievement. A Confucian notion of competence is advocated where skills are premised on the virtue of ren (humanity) and demonstrated through appropriate judgement in everyday settings. A Confucian perspective offers an alternative to the behaviourist and generic notions of performance in global assessments by highlighting the social, cultural and ethical dimensions of competence.


2018 ◽  
Vol 26 (2) ◽  
pp. 196-212 ◽  
Author(s):  
Kentaro Yamamoto ◽  
Mary Louise Lennon

Purpose Fabricated data jeopardize the reliability of large-scale population surveys and reduce the comparability of such efforts by destroying the linkage between data and measurement constructs. Such data result in the loss of comparability across participating countries and, in the case of cyclical surveys, between past and present surveys. This paper aims to describe how data fabrication can be understood in the context of the complex processes involved in the collection, handling, submission and analysis of large-scale assessment data. The actors involved in those processes, and their possible motivations for data fabrication, are also elaborated. Design/methodology/approach Computer-based assessments produce new types of information that enable us to detect the possibility of data fabrication, and therefore the need for further investigation and analysis. The paper presents three examples that illustrate how data fabrication was identified and documented in the Programme for the International Assessment of Adult Competencies (PIAAC) and the Programme for International Student Assessment (PISA) and discusses the resulting remediation efforts. Findings For two countries that participated in the first round of PIAAC, the data showed a subset of interviewers who handled many more cases than others. In Case 1, the average proficiency for respondents in those interviewers’ caseloads was much higher than expected and included many duplicate response patterns. In Case 2, anomalous response patterns were identified. Case 3 presents findings based on data analyses for one PISA country, where results for human-coded responses were shown to be highly inflated compared to past results. Originality/value This paper shows how new sources of data, such as timing information collected in computer-based assessments, can be combined with other traditional sources to detect fabrication.


Sign in / Sign up

Export Citation Format

Share Document