Abstract
Image captions aim to generate human-like sentences that describe the image’s content. Recent developments in deep learning (DL) have made it possible to caption images for accurate descriptions and detailed expressions. However, since DL learns the relationship between images and captions, it constructs sentences based on commonly frequented words in the dataset. Although these generated sentences are highly accurate, they have low lexical diversity, unlike humans due to limited vocabulary. Therefore, in this paper, we propose a Part-Of-Speech (POS) guidance module and a multimodal-based image captioning model that determines the intensity of images and word sequences and generates sentences through POS to enhance the lexical diversity of DL. The proposed POS guidance module enables rich expression by controlling the information of images and sequences based on the predicted POS guidance to predict words. Then, the POS multimodal layer adds POS and output vector of Bi-LSTM using the multimodal layer to predict the next caption, considering the grammatical structure. We trained and tested the proposed model on the Flicker 30K and MS COCO datasets and compared them with current state-of-the-art studies. Also, we analyzed the lexical diversity of the caption model through the Type-Token Ratio (TTR) and confirmed that the proposed model generates sentences using several words.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.