AbstractPretrained large language models (LLMs) have gained popularity in recent years due to their high performance in various educational tasks such as learner modeling, automated scoring, automatic item generation, and prediction. Nevertheless, LLMs are black box approaches where models are less interpretable, and they may carry human biases and prejudices because historical human data have been used for pretraining these large‐scale models. For these reasons, the prediction tasks based on LLMs require scrutiny to ensure that the prediction models are fair and unbiased. In this study, we used BERT—a pretrained encoder‐only LLM for predicting response accuracy using action sequences extracted from the 2012 PIAAC assessment. We selected three countries (i.e., Finland, Slovakia, and the United States) representing different performance levels in the overall PIAAC assessment. We found promising results for predicting response accuracy using the fine‐tuned BERT model. Additionally, we examined algorithmic bias in the prediction models trained with different countries. We found differences in model performance, suggesting that some trained models are not free from bias, and thus the models are less generalizable across countries. Our results highlighted the importance of investigating algorithmic fairness in prediction models utilizing algorithmic systems to ensure models are bias‐free.
Read full abstract