PurposeThe analysis of multimedia content is being applied in various real-time computer vision applications. In multimedia content, digital images constitute a significant part. The representation of digital images interpreted by humans is subjective in nature and complex. Hence, searching for relevant images from the archives is difficult. Thus, electronic image analysis strategies have become effective tools in the process of image interpretation.Design/methodology/approachThe traditional approach used is text-based, i.e. searching images using textual annotations. It consumes time in the manual process of annotating images and is difficult to reduce the dependency in textual annotations if the archive consists of large number of samples. Therefore, content-based image retrieval (CBIR) is adopted in which the high-level visuals of images are represented in terms of feature vectors, which contain numerical values. It is a commonly used approach to understand the content of query images in retrieving relevant images. Still, the performance is less than optimal due to the presence of semantic gap among the image content representation and human visual understanding perspective because of the image content photometric, geometric variations and occlusions in search environments.FindingsThe authors proposed an image retrieval framework to generate semantic response through the feature extraction with convolution network and optimization of extracted features using adaptive moment estimation algorithm towards enhancing the retrieval performance.Originality/valueThe proposed framework is tested on Corel-1k and ImageNet datasets resulted in an accuracy of 98 and 96%, respectively, compared to the state-of-the-art approaches.