高校の授業内スピーキングテストにおいて、シンプルなルーブリックを用い、詳細な採点者トレーニングを行わない場合に採点者信頼性が十分確保できるかを、グループ型のディスカッションとディベートで検証した。227名の高校生の発話をそれぞれ教員2名で採点し、多相ラッシュ分析・一般化可能性理論等で分析した。その結果、採点者間一致度・一貫性と採点者内一貫性の観点で十分な信頼性が満たされていることが示された。グループ型タスクの場合に、ルーブリックにやり取りの適切さでなく言語面の観点を入れる、生徒が話す時間を長めに設定する、生徒の役割や発言する順番を決める、共通認識がある教員で採点を行う等の信頼性を高める方法とその問題点が示唆された。 Securing rater reliability for classroom speaking tests can be difficult because teacher-raters typically do not have much time to engage in rater training to understand and discuss rubrics and scores. Furthermore, a teacher typically faces difficulties asking colleagues to help double mark each student’s performance. Intensive rater training and double scoring are typical procedures to maintain high reliability (Knoch et al., 2021) but are not well practiced in the classroom. However, in some cases, extensive training or double scoring is not necessary when teachers use a rubric with a few criteria and levels, which is simpler than conventional detailed rubrics (Koizumi & Watanabe, 2021). Thus, we use a group discussion and a debate to explore rater reliability when Japanese senior high school teachers use simple analytic rubrics without detailed rater training. We pose the following research questions (RQs): RQ1: To what degree are raters similar in terms of interrater consensus and consistency? RQ2: To what degree do raters score students’ responses consistently? RQ3: How many raters are required to maintain reliability? We analyzed ratings for two speaking tests administrated in September or November to 227 third-year students at a public senior high school. Each test, taken by a group of four students, included either a five-minute group discussion or a 21-minute group debate; the test administration and marking were conducted during the lesson time. An analytic rubric was developed for each task and consisted of three or four criteria with three levels (e.g., content, expression, and technique). Two of the three raters scored each student’s response during the test. Teachers had no time to discuss the rubrics in detail and engaged in only a 10-minute discussion about the rubrics before the tests. The ratings were analyzed separately for each test using weighted kappa statistics, Spearman’s rank-order correlations, many-facet Rasch measurement (MFRM), and multivariate generalizability theory (mG theory). The results indicated that the overall rater reliability was adequate, but some cases required careful training. For RQ1, the kappa statistics of two raters’ scores for each criterion ranged from poor to substantial agreement (-.06 to .84). Correlations between two raters’ scores ranged from negligible to strong (-.07 to .91) and there were not large differences in rater severity (i.e., differences in fair mean-based average values of 0.07 to 0.16 with full marks of 3). In addition, the overall agreement percentages from MFRM were higher than those predicted by MFRM (e.g., 72.9% > 71.6%). The intrarater consistency examined for RQ2 using Infit and Outfit mean squares from MFRM was also adequate (e.g., 0.86 to 1.35). The number of raters needed to maintain sufficient reliability (Φ = .70) for RQ3 was one at the overall test levels and one to three at the criterion levels. Using simple rubrics, a group discussion task, and a debate task, the results showed that rater reliability can be maintained without extensive rater training. Although the current results may have been affected by study contexts, such as procedures and students’ and raters’ characteristics, they provide pedagogical and methodological implications for developing speaking assessment tasks and procedures and reporting rater reliability statistics from multiple perspectives.
Read full abstract