Báo cáo khoa học tại Hội thảo quốc tế: Using Multiple Approaches to Examine the Dependability of VSTEP Speaking and Writing Assessments

Bài trình bày tại Hội thảo quốc tế lần thứ 4 của Hiệp hội Khảo thí Ngôn ngữ Châu Á – AALA tại Đài Bắc, Đài Loan tháng 10 năm 2017

Nhóm nghiên cứu:

Nathan T. Carr (California State University)

Nguyễn Thị Ngọc Quỳnh (ULIS, VNU)

Nguyễn Thị Quỳnh Yến (ULIS, VNU)

Nguyễn Thị Phương Thảo (ULIS, VNU)

Thái Hà Lam Thủy (ULIS, VNU)

Bùi Thiện Sao (ULIS, VNU)

Abstract: The Vietnamese Standardized Test of English Proficiency (VSTEP) is a test of general English proficiency developed based on the Common European Framework of Reference. It has recently been issued by Vietnam’s Ministry of Education and Training as a national instrument for English assessment. The test consists of sections assessing reading, writing, speaking, and listening, with all four sections taken by all test takers. This inclusion of performance-based tasks in a large-scale language proficiency test is intended to promote positive washback from the test, and to shift the focus of English instruction in Vietnam toward a more communicative orientation.

This presentation examines the consistency of scoring in the speaking and writing sections of the multi-level B1-C1 VSTEP. In estimating the scoring consistency of rated tests, there are a number of methodological options, none of which presents entirely satisfactory results by itself. Inter-rater correlations, for example (see, e.g., Bachman, 2004; Carr, 2010), while perhaps the simplest and most commonly used approach, tell nothing about the effects of other aspects of the testing process, such as differences in task difficulty or test takers’ language ability. Generalizability theory (see Brennan, 2001; Shavelson & Webb, 1991), in contrast, tells us how such aspects of the testing process contribute to score variation, and to dependability, but yields no information on the ability of individual test takers, the severity or leniency of individual raters, or the difficulty of specific tasks. Finally, the many-facet Rasch model (see, e.g., Bond & Fox, 2001; Linacre, 2014; McNamara, 1996) does provide information at the individual level, but without the information at the facet level and the clearly interpretable estimates of overall consistency provided by generalizability theory.

Therefore, this project adopts the triangulation approach employed in other studies (e.g., Bachman, Lynch, & Mason, 1995; Lynch & McNamara, 1998) of combining many-facet Rasch measurement with generalizability theory, while adding consideration of inter-rater score correlations as an additional source of information on scoring consistency. The results provide a clear picture of the dependability of VSTEP writing and speaking scores, as well as valuable information on areas for future test revision and improved rater training.