Abstract

Although audio modality has the potential to solve various visually challenging conditions of visual modality, there are few studies on audio-based detection. This is because the audio modality itself contains less accurate spatial information. To alleviate this issue, the existing audio-based methods adopt the visual modality in the training phase to transfer more precise spatial knowledge to the audio modality. However, they do not consider the case where the visual modality is less informative. In this paper, we present a new audio-based vehicle detector that can transfer multimodal knowledge of vehicles to the audio modality during training. To this end, we combine the audio-visual modal knowledge according to the importance of each modality to generate integrated audiovisual feature. Also, we introduce an audio-visual distillation (AVD) loss that guides representation of the audio modal feature to resemble that of the integrated audio-visual feature. As a result, our audio-based detector can perform robust vehicle detection as if it were utilizing both modalities, even if it only receives audio modality as input in the inference. Comprehensive experimental results demonstrate that our method exhibits consistent improvements over the existing methods.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call