Abstract

We introduce a novel approach to the problem of human fall detection in naturally occurring scenes. This is important because falling incidents cause thousands of deaths every year and vision-based approaches offer a promising and effective way to detect falls. To address this challenging issue, we regard it as an example of action detection and propose to also locate its temporal extent. We achieve this by exploiting the effectiveness of deep networks. In the training stage, the trimmed video clips of four phases (standing, falling, fallen and not moving) in a fall are converted to four categories of so-called dynamic image to train a deep ConvNet that scores and predicts the label of each dynamic image. In the testing stage, a set of sub-videos is generated using a sliding window on an untrimmed video that converts it to multiple dynamic images. Based on the predicted label of each dynamic image by the trained deep ConvNet, the videos are classified as falling or not by a “standing watch” for a situation consisting of the four sequential phases. In order to localize the temporal extent of the event, we propose a difference score method (DSM) based on adjacent dynamic images in the temporal sequence. We collect a new dataset, called the YouTube Fall Dataset (YTFD), which contains 430 falling incidents and 176 normal activities and use it to learn the deep network to detect falling humans. We perform experiments on datasets of varying complexity: Le2i fall detection dataset, multiple cameras fall dataset, high quality fall simulation dataset and our own YouTube Fall Dataset. The results demonstrate the effectiveness and efficiency of our approach.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call