Abstract

Action segmentation consists of temporally segmenting a video and labeling each segmented interval with a specific action label. In this work, we propose a novel action segmentation method that requires no initial video analysis and no annotated data. Our proposal involves extracting features from videos using several pre-trained deep-learning models, including spatiotemporal and self-supervised methods. Data is then transformed using a positional encoder, and finally, a clustering algorithm is applied, where each produced cluster presumably corresponds to a different single and distinguishable action. For self-supervised features, we explored DINO, and for spatiotemporal features, we investigated I3D and SlowFast methods. Moreover, two different clustering algorithms (FINCH and KMeans) were investigated, and we also explored how varying the length of video snippets that generate the feature vectors affected the quality of the segmentation. Experiments show that our method produces competitive results on the Breakfast and INRIA Instructional Videos dataset benchmarks. Our best result was produced using a composition of self-supervised features generated by DINO, FINCH clustering, and positional encoding.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call