Abstract

Fine-grained temporal action segmentation in long, untrimmed RGB videos is a key topic in visual human-machine interaction. Recent temporal convolution based approaches either use encoder-decoder(ED) architecture or dilations with doubling factor in consecutive convolution layers to segment actions in videos. However ED networks operate on low temporal resolution and the dilations in successive layers cause gridding artifacts problem. We propose depthwise separable temporal convolution network (DS-TCN) that operates on full temporal resolution and with reduced gridding effects. The basic component of DS-TCN is residual depthwise dilated block (RDDB). We explore the trade-off between large kernels and small dilation rates using RDDB. We show that our DS-TCN is capable of capturing long-term dependencies as well as local temporal cues efficiently. Our evaluation on three benchmark datasets, GTEA, 50Salads, and Breakfast demonstrates that DS-TCN outperforms the existing ED-TCN and dilation based TCN baselines even with comparatively fewer parameters.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.