Abstract
Image captioning aims to generate natural language descriptions for images. Word occurrences usually obey Zipf’s Law, the imbalance phenomenon makes the conventional training bias to majority data. However, this imbalance distribution has not been considered adequately in captioning works. In this paper, we match the imbalance learning methods in classification with image captioning, making the empirical study. We also propose a Task-aware Decoupled Learning and Fusion (TDLF) approach, which outperforms the former. Image captioning differs from classification in three main aspects: 1) captions are sequential labels that exist co-occurrence, 2) the generation methods usually follow the autoregressive manner, 3) the imbalance ratio is extremely large. To deal with these problems, our TDLF method introduces multi-task learning into the re-balancing approach. The model is composed of a shared autoregressor and two task classifiers, i.e., a conventional training classifier, and a balance-training classifier. The model is further equipped with a task-aware decoupling strategy, we propose the Task Perception Indication (TPI) to measure whether the conventional training is shifted. The balance-training classifier is trained by the biased data separately and the generations of two tasks are fused according to the TPI. Experiments on the MSCOCO database show that our model outperforms the state-of-the-art methods on generation accuracy and word diversity, demonstrating the effectiveness of the proposed method.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have