Abstract:
The score reliability of language performance tests has attracted increasing interest. Classical Test Theory cannot examine multiple sources of measurement error. Generalizability theory extends Classical Test Theory to provide a practical framework to identify and estimate multiple factors contributing to the total variance of measurement. Generalizability theory by using analysis of variance divides variances into their corresponding sources, and finds their interactions. This study used generalizability theory as a theoretical framework to investigate the effect of raters’ gender on the assessment of EFL students’ writing. Thirty Iranian university students participated in the study. They were asked to write on an independent task and an integrated task. The essays were holistically scored by 14 raters. A rater training session was held prior to scoring the writing samples. The data were analyzed using GENOVA software program. The results indicated that the male raters’ scores were as reliable as those of the female raters for both writing tasks. Large rater variance component revealed the low score generalizability in case of using one rater. The implications of the results in the educational assessment are elaborated.
Machine summary:
The results indicated that the male raters’ scores were as reliable as those of the female raters for both writing tasks.
The sources of error affecting the reliability of written compositions include the student, the scoring method, raters’ professional background, gender, experience, rating scales, the physical environment, the design of the items, and the test itself and even the methods and amount of training (Barkaoui, 2008; Brown, 2010; Cumming, Kantor & Powers, 2001; Huang, 2007, 2009, 2011; Huang & Han, 2013; Mousavi, 2007; Shohamy, Gordon & Kraemer, 1992; Weigle, 1994, 1999, 2002).
G-theory Studies on Writing Assessment Recently, several studies utilized G-theory to inspect the reliability and validity of EFL/ESL writing scores and to explore the relative effect of different facets (raters, tasks, rating scale, etc.
Discussion and Conclusion The purposes of the current study were to assess the reliability of writing assessment when taking into account the facets of tasks, raters, and raters’ gender and to find the effect of sequentially increasing the number of male and female raters.
In another study, Gebril (2006) used two different scoring rubrics to compare the performance of EFL students on independent and integrated writing tasks and reported a high correlation between the two sets of scores.
In sum, the current study attempted to investigate the score generalizability of independent and integrated writing tasks rated by male and female raters.
Implications of the Study The present research aimed to scrutinize the effects of raters’ gender on the scoring variability and reliability of IELTS different writing tasks.