Most current audio–visual datasets are class-relevant, where audio and visual modalities are annotated. Thus, current audio–visual recognition methods apply cross-modality attention or modality fusion. However, leveraging the audio modality effectively in vision-specific videos for human activity recognition is of particular challenge. We address this challenge by proposing a novel audio–visual recognition framework that effectively leverages audio modality in any vision-specific annotated dataset. The proposed framework employs language models (e.g., GPT-3, CPT-text, BERT) for building a semantic audio–video label dictionary (SAVLD) that serves as a bridge between audio and video datasets by mapping each video label to its most K-relevant audio labels. Then, SAVLD along with a pre-trained audio multi-label model are used to estimate the audio–visual modality relevance. Accordingly, we propose a novel learnable irrelevant modality dropout (IMD) to completely drop the irrelevant audio modality and fuse only the relevant modalities. Finally, for the efficiency of the proposed multimodal framework, we present an efficient two-stream video Transformer to process the visual modalities (i.e., RGB frames and optical flow). The final predictions are re-ranked with GPT-3 recommendations of the human activity classes. GPT-3 provides high-level recommendations using the labels of the detected visual objects and the audio predictions of the input video. Our framework demonstrated a remarkable performance on the vision-specific annotated datasets Kinetics400 and UCF-101 by outperforming most relevant human activity recognition methods.
Read full abstract