Activity Recognition with Moving Cameras and Few Training Examples: Applications for Detection of Autism-Related Headbanging

Peter Washington,Brianna Chrisman,Aaron Kline,Kelley Paskov,Emilie Leblanc,Cathy Hou,Dennis Wall,Nate Stockham,Onur Cezmi Mutlu

doi:10.1145/3411763.3451701

Abstract

Activity recognition computer vision algorithms can be used to detect the presence of autism-related behaviors, including what are termed “restricted and repetitive behaviors”, or stimming, by diagnostic instruments. Examples of stimming include hand flapping, spinning, and head banging. One of the most significant bottlenecks for implementing such classifiers is the lack of sufficiently large training sets of human behavior specific to pediatric developmental delays. The data that do exist are usually recorded with a handheld camera which is itself shaky or even moving, posing a challenge for traditional feature representation approaches for activity detection which capture the camera's motion as a feature. To address these issues, we first document the advantages and limitations of current feature representation techniques for activity recognition when applied to head banging detection. We then propose a feature representation consisting exclusively of head pose keypoints. We create a computer vision classifier for detecting head banging in home videos using a time-distributed convolutional neural network (CNN) in which a single CNN extracts features from each frame in the input sequence, and these extracted features are fed as input to a long short-term memory (LSTM) network. On the binary task of predicting head banging and no head banging within videos from the Self Stimulatory Behaviour Dataset (SSBD), we reach a mean F1-score of 90.77% using 3-fold cross validation (with individual fold F1-scores of 83.3%, 89.0%, and 100.0%) when ensuring that no child who appeared in the train set was in the test set for all folds. This work documents a successful process for training a computer vision classifier which can detect a particular human motion pattern with few training examples and even when the camera recording the source clip is unstable. The process of engineering useful feature representations by visually inspecting the representations, as described here, can be a useful practice by designers and developers of interactive systems detecting human motion patterns for use in mobile and ubiquitous interactive systems.

Full Text