Motion understanding plays an important role in video-based cross-media analysis and multiple knowledge representation learning. This paper discusses physical motion recognition and prediction by deep neural networks (DNNs), such as convolutional neural networks and recurrent neural networks. In physics, motion is the relative change in position with respect to time. To ablate the moving object and the background where the motion happens, we focus on an ideal scenario where a point moves in a plane. As the first contribution, we evaluate a few popular DNN architectures from video research on the relative position change modeling. Experiment results and conclusions can be insightful in action recognition and video prediction. As the second contribution, we propose a vector network (VecNet) to model the relative change in position. VecNet considers the motion in a short interval as a vector. Meanwhile, VecNet can move a point to the corresponding position given a vector representation. To obtain the representation of the motion for a long time, we use a long short-term memory (LSTM) to aggregate or predict vector representations over time. The resulting VecNet+LSTM approach is able to effectively support both recognition and prediction, proving that modeling relative position change is necessary for motion recognition and makes motion prediction easier.
Read full abstract