Abstract

Image captioning using encoder–decoder-based approach where CNN is used as the Encoder and sequence generator like RNN as Decoder has proven to be very effective. However, this method has a drawback, that is, sequence needs to be processed in order. To overcome this drawback, some researchers have utilized the transformer model to generate captions from images using English datasets. However, none of them generated captions in Bengali using the transformer model. As a result, we utilized three different Bengali datasets to generate Bengali captions from images using the transformer model. Additionally, we compared the performance of the transformer-based model with a visual attention-based encoder–decoder approach. Finally, we compared the result of the transformer-based model with other models that employed different Bengali image captioning datasets.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.