Self-Similarity is all You Need for Fast and Light-Weight Generic Event Boundary Detection

Sourabh Vasant Gothe,Pranay Kashyap,Jayesh Rajkumar Vachhani,Rishabh Khurana

doi:10.1109/icassp49357.2023.10096176

Abstract

The self-similarity matrix (SSM) is becoming more prevalent in temporal representation understanding; it has been utilized for various video understanding tasks, such as classifying human actions, counting repetitions, and identifying generic event boundaries. Recently proposed methods for Generic Event Boundary Detection (GEBD) [1] based on SSM have obtained impressive results on the Kinetics-GEBD dataset. However, they demand a large model size and an immense number of computations to achieve good performance, making them challenging to realize on edge devices. We introduce a projected SSM with cosine distance that produces an efficient representation of SSM that can be interpreted using lighter transformer decoders. This paper presents a lightweight novel architecture with just 3M trainable parameters that utilize projected SSM to solve GEBD. In addition to the low computation regime of the model, the experiments demonstrate that the architecture is invariant to the feature extractor model while inferencing. The proposed method achieves a boost of 13.92% on the Kinetics-GEBD validation dataset with 3.5X fewer model parameters and 19.5X fewer multi-add operations compared to the baseline [1]. We report competitive F1 results at unprecedented efficiency with 22X fewer model parameters than state-of-the-art methods and achieve the lowest inference time on GPU and mobile device.

Full Text