Abstract

Automatic audio captioning (AAC) is an important area of research aimed at generating meaningful descriptions for audio clips. Most existing methods use relevant semantic information to improve AAC performance and have demonstrated the feasibility of semantic information extraction. Audio events and keywords are commonly used for this purpose. Unlike previous studies, this study proposes a framework that uses topic modeling to obtain relevant semantic content since topic models explore the main themes of the documents. To this end, we present a framework that integrates audio embeddings with audio topics in a transformer-based encoder-decoder architecture. First, we represent each audio clip with a set of topics using a pre-trained topic model, BERTopic. Then, we design a multilayer perceptron (MLP)-based multi-label classifier to predict the topics of audio clips in the testing phase. Finally, in the proposed framework, we input audio embedding and extracted topics into the transformer model to generate captions. The results show that the proposed model improves performance and competes with the most advanced methods that utilize additional external data for training. We believe that the topic modeling can be used to extract semantic content in the AAC task.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.