Abstract

In multi-modal learning tasks such as video understanding, the most important operations are feature extraction, feature enhancement for single modality and feature aggregation between modalities. In this paper, we present two attention based algorithms, the Position-embedding Non-local (PE-NL) Network and the Multi-modal Attention (MA) feature aggregation method. Inspired by Non-local Neural Networks and Transformers, our PE-NL is a self-attention liked feature enhancement operation and it can capture long-range dependencies and model relative positions. The MA aggregation method merges visual and audio modals while reduces feature dimension and the number of parameters without losing too much accuracy. Both of PE-NL and MA blocks can be plugged into many multi-modal learning architectures. Our Gated PE-NL-MA network achieves competitive results on Youtube-8M dataset.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call