The conventional process of visual detection and manual harvesting of the banana bunch has been a known problem faced by the agricultural industry. It is a laborious activity associated with inconsistency in the inspection and grading process, leading to post-harvest losses. Automated fruit harvesting using computer vision empowered by deep learning could significantly impact the visual inspection process domains, allowing consistent harvesting and grading. To achieve the goal of the industry-level harvesting process, this work collects data from professional harvesters from the industry. It investigates six state-of-the-art architectures to find the best solution. 2,685 samples were collected from four different sites with expert opinions from industry harvesters to cut (or harvest) and keep (or not harvest) the banana brunch. Comparative results showed that the DenseNet121 architecture outperformed the other examined architectures, reaching a precision, recall, F1 score, accuracy, and specificity of 85%, 82%, 82%, 83%, and 83%, respectively. In addition, an understanding of the underlying black box nature of the solution was visualized and found adequate. This visual interpretation of the model supports human expert’s criteria for harvesting. This system can assist or replace human experts in the field.