Abstract

Action segmentation is a research hotspot in human action analysis, which aims to split videos into segments of different actions. Recent algorithms have achieved great success in modeling based on temporal convolution, but these methods weight local or global timing information through additional modules, ignoring the existing long-term and short-term information connections between actions. This paper proposes a U-Transformer structure based on multi-level refinement, introduces neighborhood attention to learn the neighborhood information of adjacent frames, and aggregates video frame features to effectively process long-term sequence information. Then a loss optimization strategy is proposed to smooth the original classification effect and generate a more accurate calibration sequence by introducing a pairing similarity optimization method based on deep feature learning. In addition, we propose a timestamp supervised training method to generate complete information for actions based on pseudo-label predictions for action boundary predictions. Experiments on three challenging action segmentation datasets, 50Salads, GTEA, and Breakfast, show that our model performs state-of-the-art models, and our weakly supervised model also performs comparably to fully supervised performance.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.