Clozing the gap: How far do cloze items measure?

Originally designed to measure reading and passage comprehension in L1 readers, cloze tests continue to be used for L2 assessment purposes. However, there remain disputes about whether or not cloze items can measure beyond local comprehension information, as well as whether or not they are purely a test of reading alone, or if performance can be generalized to broader claims about proficiency. The current study sets out to address both of these issues by drawing on a large pool of cloze items ( k = 449) taken from 15 cloze passages that were administered to 675 L1 and 2246 L2 examinees. In conjunction with test scores, a large-scale L1 experiment was conducted using Amazon’s Mechanical Turk to determine the level of minimum context required to answer each item. Using Rasch analysis, item function was compared across both groups, with results indicating that cloze items can draw on information at both the sentence and passage level. This seems to suggest further that cloze tests generally tend to measure reading in both L1 and L2 examinees. These findings have important implications for the continued use of cloze tests, particularly in classroom and high-stakes contexts where they are commonly found.

Download Full-text

TheChildLab.com A Video Chat Platform for Developmental Research

10.31234/osf.io/rn7w5 ◽

2018 ◽

Cited By ~ 2

Author(s):

Mark Sheskin ◽

Frank Keil

Keyword(s):

Large Scale ◽

Longitudinal Research ◽

Cross Cultural ◽

Mechanical Turk ◽

Future Research ◽

Psychology Research ◽

Developmental Research ◽

The Past ◽

Adult Participants ◽

Amazon's Mechanical Turk

Over the past decade, the internet has become an important platform for many types of psychology research, especially research with adult participants on Amazon’s Mechanical Turk. More recently, developmental researchers have begun to explore how online studies might be conducted with infants and children. Here, we introduce a new platform for online developmental research that includes live interaction with a researcher, and use it to replicate classic results in the literature. We end by discussing future research, including the potential for large-scale cross-cultural and longitudinal research.

Download Full-text

Language Assessment for Immigration: A Review of Validation Research Over the Last Two Decades

Frontiers in Psychology ◽

10.3389/fpsyg.2021.773132 ◽

2021 ◽

Vol 12 ◽

Author(s):

Don Yao ◽

Matthew P. Wallace

Keyword(s):

Decision Making ◽

Test Scores ◽

Large Scale ◽

Test Validity ◽

Language Assessment ◽

Future Research ◽

Negative Consequences ◽

High Stakes ◽

Language Tests ◽

Assessment Use Argument

It is not uncommon for immigration-seekers to be actively involved in taking various language tests for immigration purposes. Given the large-scale and high-stakes nature those language tests possess, the validity issues (e.g., appropriate score-based interpretations and decisions) associated with them are of great importance as test scores may play a gate-keeping role in immigration. Though interest in investigating the validity of language tests for immigration purposes is becoming prevalent, there has to be a systematic review of the research foci and results of this body of research. To address this need, the current paper critically reviewed 11 validation studies on language assessment for immigration over the last two decades to identify what has been focused on and what has been overlooked in the empirical research and to discuss current research interests and future research trends. Assessment Use Argument (AUA) framework of Bachman and Palmer (2010), comprising four inferences (i.e., assessment records, interpretations, decisions, and consequences), was adopted to collect and examine evidence of test validity. Results showed the consequences inference received the most investigations focusing on immigration-seekers’ and policymakers’ perceptions on test consequences, while the decisions inference was the least probed stressing immigration-seekers’ attitude towards the impartiality of decision-making. It is recommended that further studies could explore more kinds of stakeholders (e.g., test developers) in terms of their perceptions on the test and investigate more about the fairness of decision-making based on test scores. Additionally, the current AUA framework includes only positive and negative consequences that an assessment may engender but does not take compounded consequences into account. It is suggested that further research could enrich the framework. The paper sheds some light on the field of language assessment for immigration and brings about theoretical, practical, and political implications for different kinds of stakeholders (e.g., researchers, test developers, and policymakers).

Download Full-text

Results from the Mechanical Turk Online Audio Recordings

New England English ◽

10.1093/oso/9780190625658.003.0004 ◽

2019 ◽

pp. 75-112

Author(s):

James N. Stanford

Keyword(s):

Data Collection ◽

New England ◽

Large Scale ◽

Statistical Analyses ◽

Mechanical Turk ◽

Online Data ◽

Audio Recordings ◽

Online Data Collection ◽

Amazon's Mechanical Turk ◽

Broad Scale

This is the first of the two chapters (Chapters 4 and 5) that present the results of the online data collection project using Amazon’s Mechanical Turk system. These projects provide a broad-scale “bird’s eye” view of New England dialect features across large distances. This chapter examines the results from 626 speakers who audio-recorded themselves reading 12 sentences two times each. The recordings were analyzed acoustically and then modeled statistically and graphically. The results are presented in the form of maps and statistical analyses, with the goal of providing a large-scale geographic overview of modern-day patterns of New England dialect features.

Download Full-text

Investigating the Alignment Between the CELPIP-General Reading Test and the Canadian Language Benchmarks: A Content Validation Study

Canadian Journal of Applied Linguistics ◽

10.37213/cjal.2020.30649 ◽

2020 ◽

Vol 23 (2) ◽

pp. 1-19

Author(s):

Michelle Chen ◽

Jennifer J. Flasko

Keyword(s):

Language Proficiency ◽

Test Scores ◽

Large Scale ◽

English Language ◽

Performance Standards ◽

National Language ◽

Test Validation ◽

Content Validation ◽

High Stakes ◽

Scale Anchoring

Seeking evidence to support content validity is essential to test validation. This is especially the case in contexts where test scores are interpreted in relation to external proficiency standards and where new test content is constantly being produced to meet test administration and security demands. In this paper, we describe a modified scale- anchoring approach to assessing the alignment between the Canadian English Language Proficiency Index Program (CELPIP) test and the Canadian Language Benchmarks (CLB), the proficiency framework to which the test scores are linked. We discuss how proficiency frameworks such as the CLB can be used to support the content validation of large-scale standardized tests through an evaluation of the alignment between the test content and the performance standards. By sharing both the positive implications and challenges of working with the CLB in high-stakes language test validation, we hope to help raise the profile of this national language framework among scholars and practitioners.

Download Full-text

A double-slit experiment with human subjects

PLoS ONE ◽

10.1371/journal.pone.0246526 ◽

2021 ◽

Vol 16 (2) ◽

pp. e0246526

Author(s):

John Duffy ◽

Ted Loch-Temzelides

Keyword(s):

Decision Theory ◽

Human Subjects ◽

Repeated Measurements ◽

Mechanical Turk ◽

Large Pool ◽

Amazon's Mechanical Turk ◽

Double Slit Experiment ◽

Slit Experiment ◽

Double Slit ◽

Initial Measurement

We study a sequence of “double-slit” experiments designed to perform repeated measurements of an attribute in a large pool of subjects using Amazon’s Mechanical Turk. Our findings contrast the prescriptions of decision theory in novel and interesting ways. The response to an identical sequel measurement of the same attribute can be at significant variance with the initial measurement. Furthermore, the response to the sequel measurement depends on whether the initial measurement has taken place. In the absence of the initial measurement, the sequel measurement reveals additional variability, leading to a multimodal frequency distribution which is largely absent if the first measurement has taken place.

Download Full-text

Supplemental Material for Reliability and Validity of Data Obtained From Alcohol, Cannabis, and Gambling Populations on Amazon’s Mechanical Turk

Psychology of Addictive Behaviors ◽

10.1037/adb0000219.supp ◽

2016 ◽

Keyword(s):

Reliability And Validity ◽

Mechanical Turk ◽

Amazon's Mechanical Turk

Download Full-text

Evaluating Amazon's Mechanical Turk Workers for Making Profit-Maximizing Decisions with Decision Support

SSRN Electronic Journal ◽

10.2139/ssrn.2874605 ◽

2016 ◽

Author(s):

Yun Shin Lee ◽

Yong Won Seo

Keyword(s):

Decision Support ◽

Mechanical Turk ◽

Profit Maximizing ◽

Amazon's Mechanical Turk

Download Full-text

A Technical Guide to Using Amazon's Mechanical Turk in Behavioral Accounting Research

Behavioral Research in Accounting ◽

10.2308/bria-51977 ◽

2017 ◽

Vol 30 (1) ◽

pp. 111-122 ◽

Cited By ~ 24

Author(s):

Steve Buchheit ◽

Marcus M. Doxey ◽

Troy Pollard ◽

Shane R. Stinson

Keyword(s):

Third Party ◽

Mechanical Turk ◽

Behavioral Accounting ◽

Online Data ◽

Research Participants ◽

Traditional Research ◽

Potential Benefits ◽

Technical Guide ◽

Online Data Collection ◽

Amazon's Mechanical Turk

ABSTRACT Multiple social science researchers claim that online data collection, mainly via Amazon's Mechanical Turk (MTurk), has revolutionized the behavioral sciences (Gureckis et al. 2016; Litman, Robinson, and Abberbock 2017). While MTurk-based research has grown exponentially in recent years (Chandler and Shapiro 2016), reasonable concerns have been raised about online research participants' ability to proxy for traditional research participants (Chandler, Mueller, and Paolacci 2014). This paper reviews recent MTurk research and provides further guidance for recruiting samples of MTurk participants from populations of interest to behavioral accounting researchers. First, we provide guidance on the logistics of using MTurk and discuss the potential benefits offered by TurkPrime, a third-party service provider. Second, we discuss ways to overcome challenges related to targeted participant recruiting in an online environment. Finally, we offer suggestions for disclosures that authors may provide about their efforts to attract participants and analyze responses.

Download Full-text

Internet use at and outside of school in relation to low- and high-stakes mathematics test scores across 3 years

International Journal of STEM Education ◽

10.1186/s40594-021-00287-y ◽

2021 ◽

Vol 8 (1) ◽

Author(s):

Dmitri Rozgonjuk ◽

Karin Täht ◽

Kristjan Vassil

Keyword(s):

Technology Use ◽

Test Scores ◽

Internet Use ◽

High Stakes ◽

Data Set ◽

Mathematics Scores ◽

9Th Grade ◽

12Th Grade ◽

Mathematics Test ◽

Exam Scores

Abstract Background The excessive use of Internet-based technologies has received a considerable attention over the past years. Despite this, there is relatively little research on how general Internet usage patterns at and outside of school as well as on weekends may be associated with mathematics achievement. Moreover, only a handful of studies have implemented a longitudinal or repeated-measures approach on this research question. The aim of the current study was to fill that gap. Specifically, we investigated the potential associations of Internet use at and outside of school as well as on weekends with mathematics test performance in both high- and low-stakes testing conditions over a period of 3 years in a representative sample of Estonian teenagers. Methods PISA 2015 survey data in conjunction with national educational registry data were used for the current study. Specifically, Internet use at and outside of school as well as on weekends were queried during the PISA 2015 survey. In addition, the data set included PISA mathematics test results from 4113 Estonian 9th-grade students. Furthermore, 3758 of these students also had a 9th-grade national mathematics exam score from a couple of months after the PISA survey. Finally, of these students, the results of 12th-grade mathematics national exam scores were available for 1612 and 1174 students for “wide” (comprehensive) and “narrow” (less comprehensive) mathematics exams, respectively. Results The results showed that the rather low-stakes PISA mathematics test scores correlated well with the high-stakes national mathematics exam scores obtained from the 9th (completed a couple of months after the PISA survey) and 12th grade (completed approximately 3 years after the PISA survey), with correlation values ranging from r = .438 to .557. Furthermore, socioeconomic status index was positively correlated with all mathematics scores (ranging from r = .162 to .305). Controlled for age and gender, the results also showed that students who reported using Internet the longest tended to have, on average, the lowest mathematics scores in all tests across 3 years. Although effect sizes were generally small, they seemed to be more pronounced in Internet use at school. Conclusions Based on these results, one may notice that significantly longer time spent on Internet use at and outside of school as well as on weekends may be associated with poorer mathematics performance. These results are somewhat in line with research outlining the potentially negative associations between longer time spent on digital technology use and daily life outcomes.

Download Full-text

Running Behavioral Operations Experiments Using Amazon's Mechanical Turk

SSRN Electronic Journal ◽

10.2139/ssrn.2972406 ◽

2017 ◽

Cited By ~ 1

Author(s):

Yun Lee ◽

Yong Won Seo ◽

Enno Siemsen

Keyword(s):

Mechanical Turk ◽

Behavioral Operations ◽

Amazon's Mechanical Turk

Download Full-text