Abstract

With the incorporation of pre-training, transfer learning and keyword input, notable improvement has been made in audio captioning on generating accurate audio event descriptions in recent years. However, current captioning models tend to generate repetitive and generic sentences which often contain the most frequent patterns in the training data. Some works in natural language generation make an effort to improve the diversity by attending to specific contents or increasing the generated caption number. However, these approaches often enhance the diversity with the sacrifice of description accuracy. In this work, we propose a novel neural conditional captioning model to balance the diversity and accuracy trade-off. Compared with the statistical condition, the neural condition is the posterior given by a neural discriminator. Given the reference condition, the captioning model is trained to generate captions with a similar posterior. The captioning model and the discriminator are trained in an adversarial way. We evaluate the proposed approach on Clotho and Audiocaps. The results show that compared with baselines, our approach can improve the output diversity with the least accuracy decline.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call