In this article, we introduce a Multi-task Learning Approach for Image Captioning (MLAIC), which is inspired by the fact that people can readily finish this job given their proficiency in a variety of fields. There are three crucial components that make up MLAIC in particular:(i)A multi-object categorization model that uses a CNN image decoder to learn intricate category-aware picture representations (ii) A model for creating image captions that uses an LSTM-based decoder that is grammar conscious and shares its CNN encoder and LSTM decoder with an object categorization job to create text summaries of pictures. The additional object categorization and grammatical skills are particularly relevant to the job of creation. (ii) A syntactic generation model that enhances LSTM-based decoders that are syntax cognizant. An effective grammar creation model for the image labeling model is (iii).Our model beats other strong competitors in terms of efficiency, according to testing results on the MS-COCO dataset.
Read full abstract