Abstract

This paper addresses fully supervised action segmentation. Transformers have been shown to have large model capacity and powerful sequence modeling abilities, and hence seem quite suitable for capturing action grammar in videos. However, their performance in video understanding still lags behind that of temporal convolutional networks, or ConvNets for short. We hypothesize that this is because: (i) ConvNets tend to generalize better than Transformers, and (ii) Transformer's large model capacity requires significantly larger training datasets than existing action segmentation benchmarks. We specify a new hybrid model, TCTr, that combines the strengths from both frameworks. TCTr seamlessly unifies depth-wise convolution and self-attention in a principled manner. Also, TCTr addresses the Transformer's quadratic computational and memory complexity in the sequence length by learning how to adaptively estimate attention from local temporal neighborhoods, instead of all frames. Our experiments show that TCTr significantly outperforms the state of the art on the Breakfast, GTEA, and 50Salads datasets.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call