Abstract

The self-similarity matrix (SSM) is becoming more prevalent in temporal representation understanding; it has been utilized for various video understanding tasks, such as classifying human actions, counting repetitions, and identifying generic event boundaries. Recently proposed methods for Generic Event Boundary Detection (GEBD) [1] based on SSM have obtained impressive results on the Kinetics-GEBD dataset. However, they demand a large model size and an immense number of computations to achieve good performance, making them challenging to realize on edge devices. We introduce a projected SSM with cosine distance that produces an efficient representation of SSM that can be interpreted using lighter transformer decoders. This paper presents a lightweight novel architecture with just 3M trainable parameters that utilize projected SSM to solve GEBD. In addition to the low computation regime of the model, the experiments demonstrate that the architecture is invariant to the feature extractor model while inferencing. The proposed method achieves a boost of 13.92% on the Kinetics-GEBD validation dataset with 3.5X fewer model parameters and 19.5X fewer multi-add operations compared to the baseline [1]. We report competitive F1 results at unprecedented efficiency with 22X fewer model parameters than state-of-the-art methods and achieve the lowest inference time on GPU and mobile device.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.