High-quality items are essential for producing reliable and valid assessments, offering valuable insights for decision-making processes. As the demand for items with strong psychometric properties increases for both summative and formative assessments, automatic item generation (AIG) has gained prominence. Research highlights the potential of large language models (LLMs) in the AIG process, noting the positive impact of generative AI tools like ChatGPT on educational assessments, recognized for their ability to generate various item types across different languages and subjects. This study fills a research gap by exploring how AI-generated items in secondary/high school physics aligned with educational taxonomy. It utilizes Bloom's taxonomy, a well-known framework for designing and categorizing assessment items across various cognitive levels, from low to high. It focuses on a preliminary assessment of LLMs ability to generate physics items that match the Bloom’s taxonomy application level. Two leading LLMs, ChatGPT (GPT-4) and Gemini, were chosen for their strong performance in creating high-quality educational content. The research utilized various prompts to generate items at different cognitive levels based on Bloom's taxonomy. These items were assessed using multiple criteria: clarity, accuracy, absence of misleading content, appropriate complexity, correct language use, alignment with the intended level of Bloom's taxonomy, solvability, and assurance of a single correct answer. The findings indicated that both ChatGPT and Gemini were skilled at generating physics assessment items, though their effectiveness varied based on the prompting methods used. Instructional prompts, particularly, resulted in excellent outputs from both models, producing items that were clear, precise, and consistently aligned with the Application level of Bloom's taxonomy.
Read full abstract