Test validation in interpreter certification performance testing

Over the past decade, interpreter certification performance testing has gained momentum. Certification tests often involve high stakes, since they can play an important role in regulating access to professional practice and serve to provide a measure of professional competence for end users. The decision to award certification is based on inferences from candidates’ test scores about their knowledge, skills and abilities, as well as their interpreting performance in a given target domain. To justify the appropriateness of score-based inferences and actions, test developers need to provide evidence that the test is valid and reliable through a process of test validation. However, there is little evidence that test qualities are systematically evaluated in interpreter certification testing. In an attempt to address this problem, this paper proposes a theoretical argument-based validation framework for interpreter certification performance tests so as to guide testers in carrying out systematic validation research. Before presenting the framework, validity theory is reviewed, and an examination of the argument-based approach to validation is provided. A validity argument for interpreter tests is then proposed, with hypothesized validity evidence. Examples of evidence are drawn from relevant empirical work, where available. Gaps in the available evidence are highlighted and suggestions for research are made.

Download Full-text

Investigating rater severity/leniency in interpreter performance testing

Interpreting ◽

10.1075/intp.17.2.05han ◽

2015 ◽

Vol 17 (2) ◽

pp. 255-283 ◽

Cited By ~ 12

Author(s):

Chao Han

Keyword(s):

Rating Scales ◽

Performance Testing ◽

Experimental Setting ◽

Rater Training ◽

High Stakes ◽

Rater Reliability ◽

Certification Testing ◽

Rigorous Research ◽

Rater Severity ◽

Degree Of Severity

Rater-mediated performance assessment (RMPA) is a critical component of interpreter certification testing systems worldwide. Given the acknowledged rater variability in RMPA and the high-stakes nature of certification testing, it is crucial to ensure rater reliability in interpreter certification performance testing (ICPT). However, a review of current ICPT practice indicates that rigorous research on rater reliability is lacking. Against this background, the present study reports on use of multifaceted Rasch measurement (MFRM) to identify the degree of severity/leniency in different raters’ assessments of simultaneous interpretations (SIs) by 32 interpreters in an experimental setting. Nine raters specifically trained for the purpose were asked to evaluate four English-to-Chinese SIs by each of the interpreters, using three 8-point rating scales (information content, fluency, expression). The source texts differed in speed and in the speaker’s accent (native vs non-native). Rater-generated scores were then subjected to MFRM analysis, using the FACETS program. The following general trends emerged: 1) homogeneity statistics showed that not all raters were equally severe overall; and 2) bias analyses showed that a relatively large proportion of the raters had significantly biased interactions with the interpreters and the assessment criteria. Implications for practical rating arrangements in ICPT, and for rater training, are discussed.

Download Full-text

An Empirical Comparison of Bias Reduction Methods on Real-World Problems in High-Stakes Policy Settings

ACM SIGKDD Explorations Newsletter ◽

10.1145/3468507.3468518 ◽

2021 ◽

Vol 23 (1) ◽

pp. 69-85

Author(s):

Hemank Lamba ◽

Kit T. Rodolfa ◽

Rayid Ghani

Keyword(s):

Machine Learning ◽

Real World ◽

Empirical Work ◽

Bias Reduction ◽

Research Community ◽

High Stakes ◽

Social Good ◽

Social Service Delivery ◽

Reduction Methods ◽

Applications Of Machine Learning

Applications of machine learning (ML) to high-stakes policy settings - such as education, criminal justice, healthcare, and social service delivery - have grown rapidly in recent years, sparking important conversations about how to ensure fair outcomes from these systems. The machine learning research community has responded to this challenge with a wide array of proposed fairness-enhancing strategies for ML models, but despite the large number of methods that have been developed, little empirical work exists evaluating these methods in real-world settings. Here, we seek to fill this research gap by investigating the performance of several methods that operate at different points in the ML pipeline across four real-world public policy and social good problems. Across these problems, we find a wide degree of variability and inconsistency in the ability of many of these methods to improve model fairness, but postprocessing by choosing group-specific score thresholds consistently removes disparities, with important implications for both the ML research community and practitioners deploying machine learning to inform consequential policy decisions.

Download Full-text

USING THE ANALYTICAL HIERARCHY PROCESS IN SELECTING COMMERCIAL REAL-TIME OPERATING SYSTEMS

International Journal of Information Technology & Decision Making ◽

10.1142/s0219622009003314 ◽

2009 ◽

Vol 08 (01) ◽

pp. 151-168 ◽

Cited By ~ 15

Author(s):

NORITA AHMAD ◽

PHILLIP A. LAPLANTE

Keyword(s):

Decision Making ◽

Real Time ◽

Operating Systems ◽

Analytical Hierarchy Process ◽

Performance Testing ◽

Complex Problem ◽

Systematic Evaluation ◽

Evaluation Procedure ◽

Theoretical Argument ◽

Hierarchy Process

This work involves the novel application of a traditional technique for multivariate decision making to a complex problem in software selection. In particular, we formalize the problem statement "how does one select an appropriate commercial real-time operating system for a specific application" and we review two previous solutions to this decision making problem. Along the way, we will consider various aspects of the theoretical argument, broaden its application, and address the deficiencies of the two other solution strategies. Then, we present an improved solution to the problem using the well-developed Analytical Hierarchy Process (AHP), which is not traditionally used by software engineers. We then demonstrate the application of this method using the same criteria used in the previous methods. By explicitly representing preference, by providing tools that allow users to set and inspect their judgments, and by affording users with systematic evaluation procedure, the contribution is to help the decision maker to better identify an appropriate real-time operating systems solution without the need for intensive performance testing.

Download Full-text

Assessing professional competence: a critical review of the Annual Review of Competence Progression

Journal of the Royal Society of Medicine ◽

10.1177/0141076819848113 ◽

2019 ◽

Vol 112 (6) ◽

pp. 236-244 ◽

Cited By ~ 4

Author(s):

Katherine Woolf ◽

Michael Page ◽

Rowena Viney

Keyword(s):

Professional Competence ◽

Grey Literature ◽

Annual Review ◽

Validity And Reliability ◽

High Stakes ◽

The United Kingdom ◽

Quality Feedback ◽

Training Stage ◽

National Guidance

The Annual Review of Competence Progression is used to determine whether trainee doctors in the United Kingdom are safe and competent to progress to the next training stage. In this article we provide evidence to inform recommendations to enhance the validity of the summative and formative elements of the Annual Review of Competency Progression. The work was commissioned as part of a Health Education England review. We systematic searched the peer reviewed and grey literature, synthesising findings with information from national, local and specialty-specific Annual Review of Competence Progression guidance, critically evaluating the findings in the context of literature on assessing competence in medical education. National guidance lacked detail resulting in variability across locations and specialties, threatening validity and reliability. Trainees and trainers were concerned that the Annual Review of Competence Progression only reliably identifies the most poorly performing trainees. Feedback is not routinely provided, which can leave those with performance difficulties unsupported and high performers demotivated. Variability in the provision and quality of feedback can negatively affect learning. The Annual Review of Competence Progression functions as a high-stakes assessment, likely to have a significant impact on patient care. It should be subject to the same rigorous evaluation as other high-stakes assessments; there should be consistency in procedures across locations, specialties and grades; and all trainees should receive high-quality feedback.

Download Full-text

Emotional Bias in Classroom Observations: Within-Rater Positive Emotion Predicts Favorable Assessments of Classroom Quality

Journal of Psychoeducational Assessment ◽

10.1177/0734282916629595 ◽

2016 ◽

Vol 35 (3) ◽

pp. 291-301 ◽

Cited By ~ 6

Author(s):

James L. Floman ◽

Carolin Hagelskamp ◽

Marc A. Brackett ◽

Susan E. Rivers

Keyword(s):

Professional Training ◽

Positive Emotion ◽

School Funding ◽

Sixth Grade ◽

Classroom Quality ◽

Classroom Observations ◽

High Stakes ◽

Certification Testing ◽

Research In Education ◽

Emotional Bias

Classroom observations increasingly inform high-stakes decisions and research in education, including the allocation of school funding and the evaluation of school-based interventions. However, trends in rater scoring tendencies over time may undermine the reliability of classroom observations. Accordingly, the present investigations, grounded in social psychology research on emotion and judgment, propose that state emotion may constitute a source of psychological bias in raters’ classroom observations. In two studies, employing independent sets of raters and approximately 5,000 videotaped fifth- and sixth-grade classroom interactions, within-rater state positive emotion was associated with favorable ratings of classroom quality using the Classroom Assessment Scoring System (CLASS). Despite various protections enacted to secure reliable and valid observations in the face of rater trends—including professional training, certification testing, and routine calibration meetings—emotional bias still emerged. Study limitations and implications for classroom observation methodology are considered.

Download Full-text

The Discourse and Reality of Carbon Dioxide Removal: Toward the Responsible Use of Metaphors in Post-normal Times

Frontiers in Climate ◽

10.3389/fclim.2020.614014 ◽

2020 ◽

Vol 2 ◽

Author(s):

Noel Castree

Keyword(s):

Carbon Dioxide ◽

Epistemic Uncertainty ◽

Target Domain ◽

High Stakes ◽

National State ◽

Source Domain ◽

To Come ◽

The Moment ◽

Roll Out ◽

Made In

There's little doubt that a variety of CDR techniques will be employed worldwide in the decades and centuries to come. Together, these techniques will alter the character and functioning of the biosphere, hydrosphere, cryosphere, pedosphere, and atmosphere. More locally, they will have immediate impacts on people and place, within diverse national state contexts. However, for the moment CDR exists more in the realm of discourse than reality. Its future roll-out in many and varied forms will depend on a series of discussions in the governmental, commercial, and civic spheres. Metaphor will be quite central to these formative discussions. Metaphors serve to structure perceptions of unfamiliar phenomena by transferring meaning from a recognized “source” domain to a new “target” domain. They can be employed in more or less felicitous, more or less noticeable, more or less defensible ways. Metaphors help to govern future action by framing present-day understandings of a world to come. To govern metaphor itself may seem as foolhardy as attempting to sieve water or converse with rocks. Yet by rehearsing some old lessons about metaphor we stand some chance of responsibly steering its employment in unfolding debates about CDR techniques and their practical governance globally. This Perspective identifies some key elements of metaphor's use that will require attention in the different contexts where CDR techniques presently get (and will in future be) discussed meaningfully. Various experts involved in CDR development and deployment have an important, though not controlling, role to play in how it gets metaphorized. This matters in our age of populism, rhetoric, misinformation, and disinformation where the willful (mis)use of certain metaphors threatens to depoliticize, polarize, or simplify future debates about CDR. What is needed is “post-normal” discourse where high stakes decisions made in the context of epistemic uncertainty are informed by clear reasoning among disparate parties whose values diverge.

Download Full-text

A REVIEW ON VALIDATING LANGUAGE TESTS

VNU Journal of Foreign Studies ◽

10.25073/2525-2445/vnufs.4343 ◽

2019 ◽

Vol 35 (1) ◽

Author(s):

Dinh Minh Thu

Keyword(s):

Construct Validity ◽

Empirical Research ◽

Test Validity ◽

Test Validation ◽

Language Test ◽

Proficiency Tests ◽

Test Quality ◽

High Stakes ◽

Language Tests ◽

Almost All

Validity in language testing and assessment has its long fundamental role in research along with reliability (Bachman & Palmer, 1996). This paper analyses basic theories and empirical research on language test validity in order to provide the notion, the classification of language test validity, the validation working frames and the trends of empirical research. Four key findings come out from the analysis. Firstly, language test validity refers to an evaluative judgment of the language test quality on the ground of evidence of the integrated components of test content, criterion and consequences through the interpretation of the meaning and utility of test scores. Secondly, construct validity is a dominating term in modern validity classification. The chronic division of construct validity into prior and post ones can help researchers have a clearer validation option. Plus, test validation can be grounded in light of Messick (1989), Bachman (1996) and Weir (2005). Finally, almost all empirical research on test validity the researcher has addressed concerns international and national high-stakes proficiency tests. The research results open gaps in test validation research for the future.

Download Full-text

CONCEPTUAL DESIGNING OF A MECHATRONIC STAND FOR PRE-CLINICAL TESTS OF LOWER LIMBS ROBOTIC PROSTHESES

Fundamental and Applied Problems of Engineering and Technology ◽

10.33979/2073-7408-2020-342-4-1-164-171 ◽

2020 ◽

Vol 4 (1) ◽

pp. 164-171

Author(s):

A.M. POLIAKOV ◽

P.K. SOPIN ◽

V.B. LAZAREV ◽

A.I. RYZHKOV ◽

M.A. KOLESOVA ◽

...

Keyword(s):

Control Systems ◽

Lower Limb ◽

Environmental Parameters ◽

Lower Limbs ◽

Design Stage ◽

Clinical Tests ◽

Certification Tests ◽

The Past ◽

Certification Testing ◽

And Control

Over the past several years, the first real prototypes of robotic prostheses have been developed in Russia, but they still lose competition to the best foreign designs and are practically not in demand in the domestic market of assistive devices. One of the reasons preventing this is the absence in Russia of stands for certification tests of robotic prostheses. In addition, the use of such stands at the design stage would significantly simplify the solution of the problem of synthesizing high-quality control systems and would not require the participation of disabled volunteers for their testing and adjustment. However, there are currently no analogues of such test benches in Russia. This paper describes a conceptual version of a mechatronic stand-simulator capable to simulate biosimilar movements of a human lower limb in various modes of physical activity under variable environmental parameters in order to adjust parameters, laboratory and preclinical tests of structures and control systems of active lower limbs prostheses and their modules. The stand can be easily upgraded for certification testing of robotic lower limb prostheses.

Download Full-text

User Performance Testing Indicator

Design Solutions for User-Centric Information Systems - Advances in Human and Social Aspects of Technology ◽

10.4018/978-1-5225-1944-7.ch012 ◽

2017 ◽

pp. 205-229

Author(s):

Bernadette Imuetinyan Iyawe

Keyword(s):

Human Computer Interaction ◽

Performance Indicators ◽

Performance Test ◽

Performance Indicator ◽

Performance Testing ◽

Test Validation ◽

User Performance ◽

Vital Part ◽

Human Robotic Interaction ◽

Validation Tool

A growing area of research in Human Computer Interaction (HCI) and Human Robotic Interaction (HRI) is the development of haptic-based user performance testing. User performance testing Usability forms a vital part of these test objectives. As a result, diverse usability methods/strategies and test features are being employed. Apparently, with the robustness of haptic-based user performance testing features, user performance still has challenges. With this regard, it is vital to identify the direction and effectiveness of these methods/strategies and test features, and improvements required in the test objectives and evaluation. This chapter seeks to investigate the challenges of user performance and the user performance indicators in some HCI and HRI researches involving haptic-based test, as well as presents a User Performance Indicator Tool (UPIT) as a test validation tool to aid designers/testers in enhancing their user performance test and test evaluation outcomes.

Download Full-text

Effect of Safety Measures on Reliability of Aircraft Structures Subjected to Damage Growth

Volume 2: 31st Design Automation Conference, Parts A and B ◽

10.1115/detc2005-85384 ◽

2005 ◽

Cited By ~ 1

Author(s):

Amit A. Kale ◽

Raphael T. Haftka

Keyword(s):

Material Properties ◽

Small Variation ◽

Control Measures ◽

Structural Safety ◽

Large Scatter ◽

Damage Growth ◽

Safety Measures ◽

Aircraft Structures ◽

Certification Tests ◽

Certification Testing

This paper demonstrates the effect of various safety measures used to design aircraft structures for damage tolerance. In addition, it sheds light on the effectiveness of measures like certification tests in improving structural safety. Typically, aircraft are designed with a safety factor of 2 on the service life in addition to other safety measures, such as conservative material properties. This paper demonstrates that small variation in material properties, loading and errors in modeling damage growth can produce large scatter in fatigue life, which means that quality control measures like certification tests are not very effective in reducing failure probability. However, it is shown that the use of machined cracks in certification can substantially increase the effectiveness of certification testing.

Download Full-text