Abstract

Visual storytelling is the task of generating a related story for a given image sequence, which has received significant attention. However, using general RNNs (such as LSTM and GRU) as the decoder limit the performance of the models in this task. This is because they can not differentiate different types of information representations. In addition, optimizing the probabilities of subsequent words conditioned on the previous ground-truth sequences can cause error accumulation during inference. Moreover, the existing method of alleviating error accumulation based on replacing reference words does not take into account the different effects of each word. To address the above problems, we propose a modified neural network named AOG-LSTM and a modified training strategy named ARS, respectively. AOG-LSTM can adaptatively pay appropriate attention to different information representations within it when predicting different words. During training, ARS replaces some words in the reference sentences with model predictions similar to the existing method. However, we utilize the selection network and selection strategy to select more appropriate words for the replacement to better improve the model. Experiments on the VIST Dataset demonstrate that our model outperforms several strong baselines on the most commonly used metrics.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call