In a previous paper we defined testFAILS, a set of benchmarks for measuring the efficacy of Large Language Models in various domains. This paper defines a second-generation framework, testFAILS-2 to measure how current AI engines are progressing towards Artificial General Intelligence (AGI). The testFAILS-2 framework offers enhanced evaluation metrics that address the latest developments in Artificial Intelligence Linguistic Systems (AILS). A key feature of this re-view is the “Chat with Alan” project, a Retrieval-Augmented Generation (RAG)-based AI bot inspired by Alan Turing, designed to distinguish between human and AI generated interactions, thereby emulating Turing’s original vision. We assess a variety of models, including ChatGPT-4o-mini and other Small Language Models (SLMs), as well as prominent Large Language Models (LLMs), utilizing expanded criteria that encompass result relevance, accessibility, cost, multimodality, agent creation capabilities, emotional AI attributes, AI search capacity, and LLM-robot integration. The analysis reveals that testFAILS-2 significantly enhances the evaluation of model robustness and user productivity, while also identifying critical areas for improvement in multimodal processing and emotional reasoning. By integrating rigorous evaluation standards and novel testing methodologies, testFAILS-2 advances the assessment of AILS, providing essential insights that contribute to the ongoing development of more effective and resilient AI systems towards achieving AGI.
Read full abstract