With the rapid advancement of artificial intelligence technologies, the integration of AI concepts into educational curricula represents an increasingly important issue. This paper presents a comparative analysis of four AI large language models, ChatGPT (now GPT-4o), Bard (now Gemini), Copilot, and Auto-GPT, in the last year, progressing from the previous into the newer versions, thus also revealing the progress over time. Tasks were selected from the Valence project, which aims to advance machine learning in high school education with material designed by human experts. The four LLMs were assessed across 13 topics, 35 units, and 12 code segments, focusing on their code generation, definition formulation, and textual task capabilities. The results were analyzed using various metrics to conduct a comprehensive evaluation. Each LLM was allowed up to five attempts to produce outputs closely aligned with human-written materials, with experts providing iterative feedback. This study evaluated the effectiveness and accuracy of these LLMs in educational content creation, offering insights into their potential roles in shaping current and future AI-centric education systems.