scholarly journals Evaluation Toolkit For Robustness Testing Of Automatic Essay Scoring Systems

2022 ◽  
Author(s):  
Anubha Kabra ◽  
Mehar Bhatia ◽  
Yaman Kumar Singla ◽  
Junyi Jessy Li ◽  
Rajiv Ratn Shah
2014 ◽  
Vol 22 (2) ◽  
pp. 291-319 ◽  
Author(s):  
SHUDONG HAO ◽  
YANYAN XU ◽  
DENGFENG KE ◽  
KAILE SU ◽  
HENGLI PENG

AbstractWriting in language tests is regarded as an important indicator for assessing language skills of test takers. As Chinese language tests become popular, scoring a large number of essays becomes a heavy and expensive task for the organizers of these tests. In the past several years, some efforts have been made to develop automated simplified Chinese essay scoring systems, reducing both costs and evaluation time. In this paper, we introduce a system called SCESS (automated Simplified Chinese Essay Scoring System) based on Weighted Finite State Automata (WFSA) and using Incremental Latent Semantic Analysis (ILSA) to deal with a large number of essays. First, SCESS uses ann-gram language model to construct a WFSA to perform text pre-processing. At this stage, the system integrates a Confusing-Character Table, a Part-Of-Speech Table, beam search and heuristic search to perform automated word segmentation and correction of essays. Experimental results show that this pre-processing procedure is effective, with a Recall Rate of 88.50%, a Detection Precision of 92.31% and a Correction Precision of 88.46%. After text pre-processing, SCESS uses ILSA to perform automated essay scoring. We have carried out experiments to compare the ILSA method with the traditional LSA method on the corpora of essays from the MHK test (the Chinese proficiency test for minorities). Experimental results indicate that ILSA has a significant advantage over LSA, in terms of both running time and memory usage. Furthermore, experimental results also show that SCESS is quite effective with a scoring performance of 89.50%.


Author(s):  
Dougal Hutchison

This chapter gives a relatively non-technical introduction to computer programs for marking of essays, generally known as Automated Essay Scoring (AES) systems. It identifies four stages in the process, which may be distinguished as training, summarising mechanical and structural aspects, describing content, and scoring, and describes how these are carried out in a number of commercially available programs. It considers how the validity of the process may be assessed, and reviews some of the evidence on how successful they are. It also discusses some of the ways in which they may fall down and describes some research investigating this. The chapter concludes with a discussion of possible future developments, and offers a number of searching questions for administrators considering the possibility of introducing AES in their own schools.


2019 ◽  
Vol 5 ◽  
pp. e208 ◽  
Author(s):  
Mohamed Abdellatif Hussein ◽  
Hesham Hassan ◽  
Mohammad Nassef

Background Writing composition is a significant factor for measuring test-takers’ ability in any language exam. However, the assessment (scoring) of these writing compositions or essays is a very challenging process in terms of reliability and time. The need for objective and quick scores has raised the need for a computer system that can automatically grade essay questions targeting specific prompts. Automated Essay Scoring (AES) systems are used to overcome the challenges of scoring writing tasks by using Natural Language Processing (NLP) and machine learning techniques. The purpose of this paper is to review the literature for the AES systems used for grading the essay questions. Methodology We have reviewed the existing literature using Google Scholar, EBSCO and ERIC to search for the terms “AES”, “Automated Essay Scoring”, “Automated Essay Grading”, or “Automatic Essay” for essays written in English language. Two categories have been identified: handcrafted features and automatically featured AES systems. The systems of the former category are closely bonded to the quality of the designed features. On the other hand, the systems of the latter category are based on the automatic learning of the features and relations between an essay and its score without any handcrafted features. We reviewed the systems of the two categories in terms of system primary focus, technique(s) used in the system, the need for training data, instructional application (feedback system), and the correlation between e-scores and human scores. The paper includes three main sections. First, we present a structured literature review of the available Handcrafted Features AES systems. Second, we present a structured literature review of the available Automatic Featuring AES systems. Finally, we draw a set of discussions and conclusions. Results AES models have been found to utilize a broad range of manually-tuned shallow and deep linguistic features. AES systems have many strengths in reducing labor-intensive marking activities, ensuring a consistent application of scoring criteria, and ensuring the objectivity of scoring. Although many techniques have been implemented to improve the AES systems, three primary challenges have been identified. The challenges are lacking of the sense of the rater as a person, the potential that the systems can be deceived into giving a lower or higher score to an essay than it deserves, and the limited ability to assess the creativity of the ideas and propositions and evaluate their practicality. Many techniques have only been used to address the first two challenges.


2019 ◽  
Vol 58 (4) ◽  
pp. 771-790
Author(s):  
Leyi Qian ◽  
Yali Zhao ◽  
Yan Cheng

Automated writing scoring can not only provide holistic scores but also instant and corrective feedback on L2 learners’ writing quality. It has been increasing in use throughout China and internationally. Given the advantages, the past several years has witnessed the emergence and growth of writing evaluation products in China. To the best of our knowledge, no previous studies have touched upon the validity of China’s automated essay scoring systems. By drawing on the four major categories of argument for validity framework proposed by Kane—scoring, generalization, extrapolation, and implication, this article aims to evaluate the performance of one of the China’s automated essay scoring systems—iWrite against human scores. The results show that iWrite fails to be a valid tool to assess L2 writings and predict human scores. Therefore, iWrite currently should be restricted to nonconsequential uses and cannot be employed as an alternative to or a substitute for human raters.


Sign in / Sign up

Export Citation Format

Share Document