More than one hundred benchmarks have been developed to test the commonsense knowledge and commonsense reasoning abilities of artificial intelligence (AI) systems. However, these benchmarks are often flawed, and many aspects of common sense remain untested. Consequently, there is currently no reliable way of measuring to what extent existing AI systems have achieved these abilities. This article surveys the development and uses of AI commonsense benchmarks. It enumerates 139 commonsense benchmarks that have been developed: 102 text-based, 18 image-based, 12 video-based, and 7 based in simulated physical environments. It gives more detailed descriptions of twelve of these, three from each category. It surveys the various methods used to construct commonsense benchmarks. It discusses the nature of common sense, the role of common sense in AI, the goals served by constructing commonsense benchmarks, desirable features of commonsense benchmarks, and flaws and gap in existing benchmarks. It concludes with a number of recommendations for future development of commonsense AI benchmarks; most importantly, that the creators of benchmarks invest the work needed to ensure that benchmark examples are consistently high quality.
Read full abstract