This study investigates problem-solving performance across four mathematical domains, using statistical techniques to analyse domain-specific differences. By leveraging the NuminaMath-TIR dataset, we categorized problems into algebra, geometry, number theory, and combinatorics, selecting 8,000 problems for the analysis. Models including GPT-4o-mini, Mathstral-7B, Qwen2.5-Math-7B, and Llama-3.1-8B-Instruct were applied to assess answer correctness. Significant differences in solution accuracy were identified, with algebra showing the highest correctness rates and combinatorics the lowest. The results highlight the impact of domain on model performance and suggest the potential for tool-integrated reasoning (TIR) techniques to enhance consistency across domains. Future work can explore targeted model training improvements, aiming to optimize educational technologies and adaptive learning systems
Read full abstract7-days of FREE Audio papers, translation & more with Prime
7-days of FREE Prime access