Abstract

Recently many methods use encoder-decoder framework for video captioning, aiming to translate short videos into natural language. These methods usually use equal interval frame sampling. However, lacking a good efficiency in sampling, it has a high temporal and spatial redundancy, resulting in unnecessary computation cost. In addition, the existing approaches simply splice different visual features on the fully connection layer. Therefore, features cannot be effectively utilized. In order to solve the defects, we proposed filtration network (FN) to select key frames, which is trained by deep reinforcement learning algorithm actor-double-critic. According to behavior psychology, the core idea of actor-double-critic is that the behavior of agent is determined by both the external environment and the internal personality. It avoids the phenomenon of unclear reward and sparse feedback in training because it gives steady feedback after each action. The key frames are sent to combine codec network (CCN) to generate sentences. The operation of feature combination in CCN make fusion of visual features by complex number representation to make good semantic modeling. Experiments and comparisons with other methods on two datasets (MSVD/MSR-VTT) show that our approach achieves better performance in terms of four metrics, BLEU-4, METEOR, ROUGE-L and CIDEr.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.