Abstract

Image caption is an important research direction at the intersection of computer vision and natural language processing. It is based on object detection, enabling machines to describe image content in human language, generating sentences with correct grammar. Most of the existing methods employ a Transformer-based structure which achieve the cutting-edge performance. However, most methods focus on improving visual feature information extraction, optimizing and improving between grid features and region features, and improving the performance of the final model. In this paper, we tried to improve the final effect of the model from the perspective of the model structure and the visual features extraction. We proposed Feature-Fusion Parallel Decoding Transformer (FPDT) which adopts parallel decoding mode and uses both grid features and region features. We conducted a large number of experimental studies on the MSCOCO dataset. And FPDT's performance on MSCOCO datasets is also at the cutting edge.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.