Audio captioning, a comprehensive task of audio understanding, aims to provide a natural-language description of an audio clip. Beyond accuracy, diversity is also a critical requirement for this task. Human-produced captions possess rich variability due to the ambiguity of audio semantics (such as insects buzzing and electrical humming making similar sounds) and the existence of subjective judgments (metaphor, affections, etc.). However, current diverse audio captioning systems fail to produce captions with near-human diversity. Recently, diffusion models have demonstrated the potential to generate data with diversity while maintaining decent accuracy, yet they have not been explored in audio captioning. On the other hand, diffusion models tend to have low generation accuracy and fluency on text data. Directly applying diffusion models to audio captioning tasks may aggravate this problem due to the small size of annotated datasets and the mutable supervision target caused by the variability in human-produced captions. In this work, we propose a diffusion-based diverse audio captioning model by incorporating the BART language model into the diffusion model for better utilization of the pre-trained linguistic knowledge. We also propose a retrieval-guided Langevin dynamics module, which enables dynamic run-time alignment between generated captions and the target audio. Extensive experiments on standard audio captioning benchmark datasets (Clotho and AudioCaps) demonstrate that our model can achieve better performance on both accuracy and diversity metrics compared with state-of-the-art diverse audio captioning systems. The implementation is available at https://github.com/AI4MyBUPT/DACRLD.
Read full abstract