Planetary geomorphological maps over a wide range of spatial and temporal scales provide important information on landforms and their evolution. The process of producing a geomorphological map is extremely time-consuming and maps are often difficult to reproduce. The success of deep learning and machine learning promises to drastically reduce the cost of producing these maps and also to increase their reproducibility. However, deep learning methods strongly rely on having sufficient ground truth data to recognize the wanted surface features. In this study, we investigate the results from an artificial intelligence (AI)–based workflow to recognize lunar boulders on images taken from a lunar orbiter to produce a global lunar map showing all boulders that have left a track in the lunar regolith. We compare the findings from the AI study with the results found by a human analyst (HA) who was handed an identical database of images to identify boulders with tracks on the images. The comparison involved 181 lunar craters from all over the lunar surface. Our results show that the AI workflow used grossly underestimates the number of identified boulders on the images that were used. The AI approach found less than one fifth of all boulders identified by the HA. The purpose of this work is not to quantify the absolute sensitivities of the two approaches but to identify the cause and origin for the differences that the two approaches deliver and make recommendations as to how the machine learning approach under the given constraints can be improved. Our research makes the case that despite the increasing ease with which deep learning methods can be applied to existing data sets, a more thorough and critical assessment of the AI results is required to ensure that future network architectures can produce the reliable geomorphological maps that these methods are capable of delivering.