Middle school math forms the basis for advanced mathematical courses leading up to the university level. Large language models (LLMs) have the potential to power next-generation educational technologies, acting as digital tutors to students. The main objective of this study was to determine whether LLMs like ChatGPT, Bard, and Llama 2 can serve as reliable middle school math tutoring assistants on three tutoring tasks: hint generation, comprehensive solution, and exercise creation. Our first hypothesis was that ChatGPT would perform better in completing all three tutoring tasks than Bard and Llama 2 due to its largest model size (175 billion parameters). Our second hypothesis was that Bard would perform better than Llama 2 in generating comprehensive correct solutions due to its relatively higher model size (137 billion parameters) than Llama 2 (70 billion parameters). We curated medium-difficulty, word-based middle school math problems on algebra, number theory, and counting/probability from The Art of Problem Solving and Khan Academy. A human tutor evaluated the LLMs' performance on each tutoring task. Contrary to our first hypothesis, results showed that ChatGPT didn't perform uniformly better than Bard and Llama 2 on all the tasks. ChatGPT outperformed both Bard and Llama 2 only in the comprehensive solution task. Bard didn't perform better than Llama 2 in the comprehensive solution task which does not support our second hypothesis. We conclude that middle school math teachers can use a combination of ChatGPT, Bard, and Llama 2 as assistants based on the specific tutoring task.
Read full abstract