The prediction task is significantly challenged by the intricate scene information and motion variations present in spatiotemporal data. Existing prediction methods struggle to accurately forecast long-term outcomes, particularly for transient motions characterized by notable trends, such as hand lifts, jumps, or vehicle turns. To address these challenges, we introduce a Spatiotemporal Motion Prediction Network based on Multi-level Feature Disentanglement (FDPNet). The model delineates spatiotemporal prediction into two distinct stages: feature disentanglement and motion prediction. We first devise a Multi-level Feature Disentanglement (MFD) model to disentangle the multilayer features of motion within the temporal sequence, encompassing period, trend, and residual components. This disentanglement is based on disentangling spatiotemporal coupling, enabling the network to comprehensively grasp the genuine laws governing motion in the spatiotemporal evolution process. Second, to enhance the prediction accuracy of the network over extended periods, we introduce the Motion Differential Self-Attention LSTM unit (MDSA-LSTM). This unit employs differential operations to extract inter-frame motion trends, elevating the network's proficiency in capturing spatiotemporal correlations over long distances through an enhanced self-attention mechanism. FDPNet attains state-of-the-art performance on the Moving MNIST, UCF101, KITTI, and Caltech pedestrian datasets. These outcomes substantiate the substantial potential of this research within the realm of spatiotemporal prediction.
Read full abstract