Fall behavior is closely related to the high mortality rate of the elderly, so fall detection has become an important and urgent research area in human behavior recognition. However, the existing fall detection methods, suffer from the loss of detailed action information during feature extraction due to the downsampling operation, resulting in subpar performance when detecting falls with similar behaviors such as lying and sitting. To solve the challenges, this study proposes a high-resolution spatio-temporal feature extraction method based on a spatio-temporal coordinate attention mechanism. The method employs 3D convolutions to extract spatio-temporal features and utilizes gradual down-sampling to generate a multi-resolution sub-network, thus realizing multi-scale fusion and perception enhancement of details. In particular, this study designs a pseudo-3D basic block, which simulates the ability of 3D convolution, to ensure the running speed of the network while controlling the number of parameters. Further, a spatio-temporal coordinate attention mechanism is designed to accurately extract the spatio-temporal positional changes of key skeletal points and the interrelationships among them. Long-term dependencies in horizontal, vertical, temporal directions are captured through three one-dimensional global pooling operations. Then the long-range relationships and channel correlations among features are captured by cascading and slicing operations. Finally, the key information is effectively highlighted by performing dot-multiplication operations between the feature maps from the horizontal, vertical and temporal directions and the input feature maps. Experimental results on three typical public datasets show that the proposed method can better extract motion features and improve the accuracy of fall detection.
Read full abstract