Automatic creation of image descriptions, i.e. captioning of images, is an important topic in artificial intelligence (AI) that bridges the gap between computer vision (CV) and natural language processing (NLP). Currently, neural networks are becoming increasingly popular in captioning images and researchers are looking for more efficient models for CV and sequence-sequence systems. This study focuses on a new image caption generation model that is divided into two stages. Initially, low-level features, such as contrast, sharpness, color and their high-level counterparts, such as motion and facial impact score, are extracted. Then, an optimized convolutional neural network (CNN) is harnessed to generate the captions from images. To enhance the accuracy of the process, the weights of CNN are optimally tuned via spider monkey optimization with sine chaotic map evaluation (SMO-SCME). The development of the proposed method is evaluated with a diversity of metrics.