Assessing writing performance in TOEFL-iBT

2020 ◽  
Vol 13 (1) ◽  
pp. 84-107
Author(s):  
Farah Shooraki ◽  
Hossein Barati ◽  
Ahmad Moinzadeh

Abstract This study aims to determine the linguistic and discoursal differences in essays produced by Iranian test-takers of TOEFL-iBT in response to integrated and independent writing tasks. A sample of 40 essays, written by 20 Iranian test-takers of scored integrated and independent writing tasks, was compared and analyzed in terms of the four latent constructs of text easability (fourteen variables), cohesion (nine variables), lexical sophistication (nineteen variables), and syntactic complexity (six variables), using the Coh-Metrix 3.0 program. Results indicate differences in the linguistic and discoursal features of integrated and independent writing tasks. The findings reveal that the scores on writing tasks of EFL test-takers can be anchored empirically through the analysis of some discourse qualities like cohesion. Independent tasks contain more connectives and particles so they can result in better discourse structure organization and the generation of more cohesive devices. Stakeholders of the test should verify test constructs in terms of particular contexts like EFL and communicative views of language proficiency. Consequently, the findings contribute to the ongoing validity argument on TOEFL-iBT writing tasks for designing and interpreting scoring schemes for the writing component of the test.

2010 ◽  
Vol 27 (3) ◽  
pp. 335-353 ◽  
Author(s):  
Sara Cushing Weigle

Automated scoring has the potential to dramatically reduce the time and costs associated with the assessment of complex skills such as writing, but its use must be validated against a variety of criteria for it to be accepted by test users and stakeholders. This study approaches validity by comparing human and automated scores on responses to TOEFL® iBT Independent writing tasks with several non-test indicators of writing ability: student self-assessment, instructor assessment, and independent ratings of non-test writing samples. Automated scores were produced using e-rater ®, developed by Educational Testing Service (ETS). Correlations between both human and e-rater scores and non-test indicators were moderate but consistent, providing criterion-related validity evidence for the use of e-rater along with human scores. The implications of the findings for the validity of automated scores are discussed.


2019 ◽  
Vol 3 (3) ◽  
pp. p251
Author(s):  
Alqahtani Mofareh A

“English is the only foreign language taught in Saudi schools as part of the mandatory curriculum and therefore enjoys a relatively high status” (Carfax Educational Projects, 2016, p. 10). The teaching of English as a Foreign Language (EFL/L2) within the basic curriculum of Saudi Arabia commences in the fourth grade. However, in spite of the best efforts of the Saudi Ministry of Education (MoE) to develop English learning in schools, the language proficiency of Saudi high school leavers remains insufficient to carry out even basic interactions, let alone undertake university study through the medium of English (Al-Johani, 2009; Al-Seghayer, 2014; Alhawsawi, 2013; Alrabai, 2016; Khan, 2011; Rajab, 2013). In fact, the recent Test of English as a Foreign Language (TOEFL iBT, 2017) demonstrated an overall average score of 64 of 120 for Saudis who took the TOEFL iBT between January and December 2016. This paper therefore seeks to examine the factors responsible for the low EFL performance of Saudi students on completion of their high school studies. In order to do so, the researcher randomly selected 60 school leavers and 30 teachers who responded to an interview designed to elicit the underlying causes of such poor English proficiency. The results revealed that the reasons fall into a number of discrete categories related to the student, the teacher, the learning environment, and the curriculum.


2010 ◽  
Vol 27 (3) ◽  
pp. 317-334 ◽  
Author(s):  
Mary K. Enright ◽  
Thomas Quinlan

E-rater® is an automated essay scoring system that uses natural language processing techniques to extract features from essays and to model statistically human holistic ratings. Educational Testing Service has investigated the use of e-rater, in conjunction with human ratings, to score one of the two writing tasks on the TOEFL-iBT® writing section. In this article we describe the TOEFL iBT writing section and an e-rater model proposed to provide one of two ratings for the Independent writing task. We discuss how the evidence for a process that uses both human and e-rater scoring is relevant to four components in a validity argument: (a) Evaluation — observations of performance on the writing task are scored to provide evidence of targeted writing skills; (b) Generalization — scores on the writing task provide estimates of expected scores over relevant parallel versions of the task and across raters; (c) Extrapolation — expected scores on the writing task are consistent with other measures of writing ability; and (d) Utilization — scores on the writing task are useful in educational contexts. Finally, we propose directions for future research that will strengthen the case for using complementary methods of scoring to improve the assessment of EFL writing.


Jezikoslovlje ◽  
2019 ◽  
Vol 20 (3) ◽  
pp. 555-582
Author(s):  
Ervin Kovačević

Although the relationship between language proficiency and learner beliefs is generally viewed as weak, indirect, and distant, there are empirical findings which show that the relationship between syntactic complexity measures and language learning beliefs is statistically tangible. Since syntactic complexity is only one constituent of the linguistic complexity system, it seems plausible to question whether other constituents of the system are also in statistically measurable relationships with language learning beliefs. This research project explores the relationship between 25 lexical complexity measures (Lu 2012; 2014) and four subscales of language learning beliefs that are suggested for Horwitz’s (2013) Beliefs about Language Learning Inventory—BALLI 2.0 (Kovačević 2017). For three semesters (Fall 2014, Spring and Fall 2015), 152 freshman students at the International University of Sarajevo responded to BALLI 2.0 and wrote in-class exam essays which were converted into an electronic format. The results show 15 statistically significant correlation coefficients between 14 lexical complexity measures and three BALLI 2.0 subscales. Overall, it may be concluded that the relationship between lexical complexity measures and language learning beliefs is statistically detectable. The findings imply that the lexical complexity framework offers valuable opportunities for exploring how and to what extent particular individual differences manifest in foreign language production.


2017 ◽  
Vol 35 (4) ◽  
pp. 529-556 ◽  
Author(s):  
Yasuyo Sawaki ◽  
Sandip Sinharay

The present study examined the reliability of the reading, listening, speaking, and writing section scores for the TOEFL iBT® test and their interrelationship in order to collect empirical evidence to support, respectively, the generalization inference and the explanation inference in the TOEFL iBT validity argument (Chapelle, Enright, & Jamieson, 2008). By combining Haberman’s (2008) subscore analysis and confirmatory factor analysis (CFA), data from four operational TOEFL iBT test administrations were analyzed for all examinees and three major native language (L1) groups (Arabic, Korean, and Spanish). Key results were consistent across the forms and samples. First, Haberman’s (2008) subscore analysis suggested that the reliabilities of the section scores were generally satisfactory but for the writing section the reliability was relatively low. Second, Haberman’s subscore analysis and CFA offered different degrees of support for the distinctness of the TOEFL iBT section scores. A subsequent multiple-group CFA based on a correlated four-factor model generally supported the measurement invariance across the L1 groups in terms of factor loadings as well as indicator residuals and intercepts, despite the population heterogeneity indicated by the partial invariance of the latent factor variances and differences in the latent factor means across the groups. In addition, Haberman’s subscore analysis suggested that the speaking section score offered value-added information owing to its generally high level of reliability and relative distinctness from the other three section scores, which is relevant to the utilization inference in the validity argument from a perspective of psychometric quality of the TOEFL iBT section scores.


2010 ◽  
Vol 15 (4) ◽  
pp. 474-496 ◽  
Author(s):  
Xiaofei Lu

We describe a computational system for automatic analysis of syntactic complexity in second language writing using fourteen different measures that have been explored or proposed in studies of second language development. The system takes a written language sample as input and produces fourteen indices of syntactic complexity of the sample based on these measures. The system is designed with advanced second language proficiency research in mind, and is therefore developed and evaluated using college-level second language writing data from the Written English Corpus of Chinese Learners (Wen et al. 2005). Experimental results show that the system achieves very high reliability on unseen test data from the corpus. We illustrate how the system is used in an example application to investigate whether and to what extent each of these measures significantly differentiate between different proficiency levels


2016 ◽  
Vol 6 (1) ◽  
pp. 81 ◽  
Author(s):  
Sue Wang ◽  
Tammy Slater

<p>Syntactic complexity as an indicator in the study of English learners’ language proficiency has been frequently employed in language development assessment. Using the Syntactic Complexity Analyzer, developed by Lu (2010), this article collected data representing the syntactic complexity indexes from the writing of Chinese non-English major students and from the writing of proficient users of English on a similar task. The results indicate that there is a significant difference in the use of complex nominals, the mean length of sentences, and the mean length of clauses between the writings of EFL Chinese students and more proficient users. This study provides suggestions for EFL writing teaching, particularly writing at the sentence level.</p>


Author(s):  
Yaochen Deng ◽  
Lei Lei ◽  
Dilin Liu

Abstract In the past two decades, syntactic complexity measures (e.g. the length or number of words per clause/t-unit/sentences and number of clauses per t-unit/sentence, and types of clauses used) have been widely used to determine and benchmark language proficiency development in speaking and writing. (Norris and Ortega 2009; Lu 2011). However, the results of some recent studies (e.g. Lu 2011; Bulté and Housen 2014; Crossley and McNamara 2014) have raised questions about the earlier findings regarding the use of such complexity measures in assessing L2 writing. While a couple of plausible explanations have been proposed for the conflicting findings, they have failed to look at the syntactic measures themselves as likely sources causing the discrepancies in the research findings. In this forum piece, we would like to argue, with empirical evidence, that the conflicting research results might have resulted from issues with some of the existing measurements of clausal and phrasal sophistication, including inconsistency and lack of necessary fine-grained differentiation in the measurements of subordination sophistication and possible inappropriate use of high values of phrasal sophistication.


2017 ◽  
Vol 35 (2) ◽  
pp. 271-295 ◽  
Author(s):  
April Ginther ◽  
Xun Yan

This study examines the predictive validity of the TOEFL iBT with respect to academic achievement as measured by the first-year grade point average (GPA) of Chinese students at Purdue University, a large, public, Research I institution in Indiana, USA. Correlations between GPA, TOEFL iBT total and subsection scores were examined on 1990 mainland Chinese students enrolled across three academic years (N2011 = 740, N2012 = 554, N2013 = 696). Subsequently, cluster analyses on the three cohorts’ TOEFL subsection scores were conducted to determine whether different score profiles might help explain the correlational patterns found between TOEFL subscale scores and GPA across the three student cohorts. For the 2011 and 2012 cohorts, speaking and writing subscale scores were positively correlated with GPA; however, negative correlations were observed for listening and reading. In contrast, for the 2013 cohort, the writing, reading, and total subscale scores were positively correlated with GPA, and the negative correlations disappeared. Results of cluster analyses suggest that the negative correlations in the 2011 and 2012 cohorts were associated with a distinctive Reading/Listening versus Speaking/Writing discrepant score profile of a single Chinese subgroup. In 2013, this subgroup disappeared in the incoming class because of changes made to the University’s international undergraduate admissions policy. The uneven score profile has important implications for admissions policy, the provision of English language support, and broader effects on academic achievement.


Sign in / Sign up

Export Citation Format

Share Document