Abstract

Visual storytelling is the task of generating a related story for a given image sequence, which has received significant attention. However, using general RNNs (such as LSTM and GRU) as the decoder limit the performance of the models in this task. This is because they can not differentiate different types of information representations. In addition, optimizing the probabilities of subsequent words conditioned on the previous ground-truth sequences can cause error accumulation during inference. Moreover, the existing method of alleviating error accumulation based on replacing reference words does not take into account the different effects of each word. To address the above problems, we propose a modified neural network named AOG-LSTM and a modified training strategy named ARS, respectively. AOG-LSTM can adaptatively pay appropriate attention to different information representations within it when predicting different words. During training, ARS replaces some words in the reference sentences with model predictions similar to the existing method. However, we utilize the selection network and selection strategy to select more appropriate words for the replacement to better improve the model. Experiments on the VIST Dataset demonstrate that our model outperforms several strong baselines on the most commonly used metrics.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.