Abstract

Employing feature vectors extracted from the target detector has been shown to be effective in improving the performance of image captioning. However, it is considered that existing framework suffers from the deficiency of insufficient information extraction, such as positional relationships; it is very important to judge the relationship between objects. To fill this gap, we present a dual position relationship transformer (DPR) for image captioning; the architecture improves the image information extraction and description coding steps: it first calculates the relative position (RP) and absolute position (AP) between objects, and integrates the dual position relationship information into self-attention. Specifically, convolutional neural network (CNN) and faster R-CNN are applied to extract image features and target detection, then to calculate the RP and AP of the generated object boxes. The former is expressed in coordinate form, and the latter is calculated by sinusoidal encoding. In addition, to better model the sequence and time relationship in the description, DPR adopts long short-term memory to encode text vector. We conduct extensive experiments on the Microsoft COCO: Common Objects in Context (MSCOCO) image captioning data set that shows that our method achieves superior performance that Consensus-based Image Description Evaluation (CIDEr) increased to 114.6 after training 30 epochs and runs 2 times faster, compared with other competitive methods. The ablation study verifies the effectiveness of our proposed module.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.