The growing interest in advanced large language models (LLMs) like ChatGPT has sparked debate about how best to use them in various human activities. However, a neglected issue in the debate concerning the applications of LLMs is whether they can reason logically and follow rules in novel contexts, which are critical for our understanding and applications of LLMs. To address this knowledge gap, this study investigates five LLMs (ChatGPT-4o, Claude, Gemini, Meta AI, and Mistral) using word ladder puzzles to assess their logical reasoning and rule-adherence capabilities. Our two-phase methodology involves (1) explicit instructions about word ladder puzzles and rules regarding how to solve the puzzles and then evaluate rule understanding, followed by (2) assessing LLMs’ ability to create and solve word ladder puzzles while adhering to rules. Additionally, we test their ability to implicitly recognize and avoid HIPAA privacy rule violations as an example of a real-world scenario. Our findings reveal that LLMs show a persistent lack of logical reasoning and systematically fail to follow puzzle rules. Furthermore, all LLMs except Claude prioritized task completion (text writing) over ethical considerations in the HIPAA test. Our findings expose critical flaws in LLMs’ reasoning and rule-following capabilities, raising concerns about their reliability in critical tasks requiring strict rule-following and logical reasoning. Therefore, we urge caution when integrating LLMs into critical fields and highlight the need for further research into their capabilities and limitations to ensure responsible AI development.
Read full abstract