Abstract

Physiological and psychophysical evidence suggest that early visual cortex compresses the visual input on the basis of spatial and orientation-tuned filters. Recent computational advances have suggested that these neural response characteristics may reflect a `sparse coding' architecture--in which a small number of neurons need to be active for any given image--yielding critical structure latent in natural scenes. Here we present a novel neural network architecture combining a sparse filter model and locally competitive algorithms (LCAs), and demonstrate the network's ability to classify human actions from video. Sparse filtering is an unsupervised feature learning algorithm designed to optimize the sparsity of the feature distribution directly without having the need to model the data distribution. LCAs are defined by a system of differential equations where the initial conditions define an optimization problem and the dynamics converge to a sparse decomposition of the input vector. We applied this architecture to train a classifier on categories of motion in human action videos. Inputs to the network were small 3D patches taken from frame differences in the videos. Dictionaries were derived for each action class and then activation levels for each dictionary were assessed during reconstruction of a novel test patch. Overall, classification accuracy was at ? 97 %. We discuss how this sparse filtering approach provides a natural framework for multi-sensory and multimodal data processing including RGB video, RGBD video, hyper-spectral video, and stereo audio/video streams.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call