Multivariate Generalizability Theory Research Articles

高校の授業内スピーキングテストにおいて、シンプルなルーブリックを用い、詳細な採点者トレーニングを行わない場合に採点者信頼性が十分確保できるかを、グループ型のディスカッションとディベートで検証した。227名の高校生の発話をそれぞれ教員2名で採点し、多相ラッシュ分析・一般化可能性理論等で分析した。その結果、採点者間一致度・一貫性と採点者内一貫性の観点で十分な信頼性が満たされていることが示された。グループ型タスクの場合に、ルーブリックにやり取りの適切さでなく言語面の観点を入れる、生徒が話す時間を長めに設定する、生徒の役割や発言する順番を決める、共通認識がある教員で採点を行う等の信頼性を高める方法とその問題点が示唆された。 Securing rater reliability for classroom speaking tests can be difficult because teacher-raters typically do not have much time to engage in rater training to understand and discuss rubrics and scores. Furthermore, a teacher typically faces difficulties asking colleagues to help double mark each student’s performance. Intensive rater training and double scoring are typical procedures to maintain high reliability (Knoch et al., 2021) but are not well practiced in the classroom. However, in some cases, extensive training or double scoring is not necessary when teachers use a rubric with a few criteria and levels, which is simpler than conventional detailed rubrics (Koizumi & Watanabe, 2021). Thus, we use a group discussion and a debate to explore rater reliability when Japanese senior high school teachers use simple analytic rubrics without detailed rater training. We pose the following research questions (RQs): RQ1: To what degree are raters similar in terms of interrater consensus and consistency? RQ2: To what degree do raters score students’ responses consistently? RQ3: How many raters are required to maintain reliability? We analyzed ratings for two speaking tests administrated in September or November to 227 third-year students at a public senior high school. Each test, taken by a group of four students, included either a five-minute group discussion or a 21-minute group debate; the test administration and marking were conducted during the lesson time. An analytic rubric was developed for each task and consisted of three or four criteria with three levels (e.g., content, expression, and technique). Two of the three raters scored each student’s response during the test. Teachers had no time to discuss the rubrics in detail and engaged in only a 10-minute discussion about the rubrics before the tests. The ratings were analyzed separately for each test using weighted kappa statistics, Spearman’s rank-order correlations, many-facet Rasch measurement (MFRM), and multivariate generalizability theory (mG theory). The results indicated that the overall rater reliability was adequate, but some cases required careful training. For RQ1, the kappa statistics of two raters’ scores for each criterion ranged from poor to substantial agreement (-.06 to .84). Correlations between two raters’ scores ranged from negligible to strong (-.07 to .91) and there were not large differences in rater severity (i.e., differences in fair mean-based average values of 0.07 to 0.16 with full marks of 3). In addition, the overall agreement percentages from MFRM were higher than those predicted by MFRM (e.g., 72.9% > 71.6%). The intrarater consistency examined for RQ2 using Infit and Outfit mean squares from MFRM was also adequate (e.g., 0.86 to 1.35). The number of raters needed to maintain sufficient reliability (Φ = .70) for RQ3 was one at the overall test levels and one to three at the criterion levels. Using simple rubrics, a group discussion task, and a debate task, the results showed that rater reliability can be maintained without extensive rater training. Although the current results may have been affected by study contexts, such as procedures and students’ and raters’ characteristics, they provide pedagogical and methodological implications for developing speaking assessment tasks and procedures and reporting rater reliability statistics from multiple perspectives.

Read full abstract

There is a paucity of validation evidence for assessing clinical case-presentations by Doctor of Pharmacy (PharmD) students. Within Kane's Framework for Validation, evidence for inferences of scoring and generalization should be generated first. Thus, our objectives were to characterize and improve scoring, as well as build initial generalization evidence, in order to provide validation evidence for performance-based assessment of clinical case-presentations. Third-year PharmD students worked up patient-cases from a local hospital. Students orally presented and defended their therapeutic care-plan to pharmacist preceptors (evaluators) and fellow students. Evaluators scored each presentation using an 11-item instrument with a 6-point rating-scale. In addition, evaluators scored a global-item with a 4-point rating-scale. Rasch Measurement was used for scoring analysis, while Generalizability Theory was used for generalization analysis. Thirty students each presented five cases that were evaluated by 15 preceptors using an 11-item instrument. Using Rasch Measurement, the 11-item instrument's 6-point rating-scale did not work; it only worked once collapsed to a 4-point rating-scale. This revised 11-item instrument also showed redundancy. Alternatively, the global-item performed reasonably on its own. Using multivariate Generalizability Theory, the g-coefficient (reliability) for the series of five case-presentations was 0.76 with the 11-item instrument, and 0.78 with the global-item. Reliability was largely dependent on multiple case-presentations and, to a lesser extent, the number of evaluators per case-presentation. Our pilot results confirm that scoring should be simple (scale and instrument). More specifically, the longer 11-item instrument measured but had redundancy, whereas the single global-item provided measurement over multiple case-presentations. Further, acceptable reliability can be balanced between more/fewer case-presentations and using more/fewer evaluators.

Read full abstract

Multivariate Generalizability Theory Research Articles

Related Topics

Articles published on Multivariate Generalizability Theory

Customizing Bayesian multivariate generalizability theory to mixed-format tests.

Reliability of a workplace-based assessment for the United States general surgical trainees’ intraoperative performance using multivariate generalizability theory: a psychometric study

Applying multivariate generalizability theory to psychological assessments.

Supplemental Material for Applying Multivariate Generalizability Theory to Psychological Assessments

Analyzing Multivariate Generalizability Theory Designs within Structural Equation Modeling Frameworks

Examining the Scoring of Content Integration in a Listening-Speaking Test: A G-Theory Analysis

Rater Reliability in Speaking Assessment in a Japanese Senior High School: Case of Classroom Group Discussion and Debate 日本の高校におけるスピーキング評価の採点者信頼性—教室内グループ型のディスカッションとディベートの場合

Examining the Dependability and Practicality of Analytic Rubric of Summary Writing Using Multivariate Generalizability Theory: Focusing on Japanese University Students with Lower-Intermediate Proficiency in English

Examining Human and Automated Ratings of Elementary Students’ Writing Quality: A Multivariate Generalizability Theory Application

Development and Validation of the Breast Cancer Scale QLICP-BR V2.0 Based on Classical Test Theory and Generalizability Theory.

Generalizability of Writing Scores and Language Program Placement Decisions: Score Dependability, Task Variability, and Score Profiles on an ESL Placement Test

A Generalizability Analysis of the Meaning in Life Questionnaire for Chinese Adolescents

Extended Multivariate Generalizability Theory With Complex Design Structures.

Initial Validation Evidence for Clinical Case Presentations by Student Pharmacists.

Using a linear mixed-effect model framework to estimate multivariate generalizability theory parameters in R.

Indices of Subscore Utility for Individuals and Subgroups Based on Multivariate Generalizability Theory.

A Generalizability Analysis of the Mobile Phone Addiction Tendency Scale for Chinese College Students.

A modification of the Schutte Emotional Intelligence Scale using multivariate generalizability theory

A Multivariate Generalizability Theory Approach to College Students' Evaluation of Teaching.

The Use of Multivariate Generalizability Theory to Evaluate the Quality of Subscores.

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Multivariate Generalizability Theory Research Articles

Related Topics

Articles published on Multivariate Generalizability Theory

Customizing Bayesian multivariate generalizability theory to mixed-format tests.

Reliability of a workplace-based assessment for the United States general surgical trainees’ intraoperative performance using multivariate generalizability theory: a psychometric study

Applying multivariate generalizability theory to psychological assessments.

Supplemental Material for Applying Multivariate Generalizability Theory to Psychological Assessments

Analyzing Multivariate Generalizability Theory Designs within Structural Equation Modeling Frameworks

Examining the Scoring of Content Integration in a Listening-Speaking Test: A G-Theory Analysis

Rater Reliability in Speaking Assessment in a Japanese Senior High School: Case of Classroom Group Discussion and Debate 日本の高校におけるスピーキング評価の採点者信頼性—教室内グループ型のディスカッションとディベートの場合

Examining the Dependability and Practicality of Analytic Rubric of Summary Writing Using Multivariate Generalizability Theory: Focusing on Japanese University Students with Lower-Intermediate Proficiency in English

Examining Human and Automated Ratings of Elementary Students’ Writing Quality: A Multivariate Generalizability Theory Application

Development and Validation of the Breast Cancer Scale QLICP-BR V2.0 Based on Classical Test Theory and Generalizability Theory.

Generalizability of Writing Scores and Language Program Placement Decisions: Score Dependability, Task Variability, and Score Profiles on an ESL Placement Test

A Generalizability Analysis of the Meaning in Life Questionnaire for Chinese Adolescents

Extended Multivariate Generalizability Theory With Complex Design Structures.

Initial Validation Evidence for Clinical Case Presentations by Student Pharmacists.

Using a linear mixed-effect model framework to estimate multivariate generalizability theory parameters in R.

Indices of Subscore Utility for Individuals and Subgroups Based on Multivariate Generalizability Theory.

A Generalizability Analysis of the Mobile Phone Addiction Tendency Scale for Chinese College Students.

A modification of the Schutte Emotional Intelligence Scale using multivariate generalizability theory

A Multivariate Generalizability Theory Approach to College Students' Evaluation of Teaching.

The Use of Multivariate Generalizability Theory to Evaluate the Quality of Subscores.