Event-Based Vision Processing in Deep Neural Networks

Bodo Rückauer

doi:10.5167/uzh-200987

Abstract

Over the past three decades, the field of neuromorphic engineering has produced sensors and processors that show great promise as efficient, brain-inspired systems. In parallel to this development, tremendous advances in Deep Learning ( DL ) have supplied highly accurate algorithms for computer vision. Unfortunately, these algorithms are not directly compatible with neuromorphic hardware. The present work bridges this gap by developing algorithms that leverage the power of Deep Learning while being suited for operation on neuro-inspired hardware. The Dynamic Vision Sensor ( DVS ), a neuromorphic sensor used in this thesis, differs radically from conventional cameras by producing a stream of asynchronous pixel-events rather than regularly spaced light-intensity frames. These events signal changes in local brightness at high temporal resolution and wide dynamic range, which makes the sensor well suited for applications with spatio-temporal redundancy, difficult lighting conditions, or fast reaction time. However, the event-based nature of the sensor output impedes the application of standard computer vision techniques like Deep Neural Networks ( DNNs ). In this thesis we demonstrate that DNNs can be adapted to operate in an event-based fashion, similar to how neural networks in the brain use discrete spikes for signal transmission. By converting Artificial Neural Networks ( ANNs ) trained with DL into Spiking Neural Networks ( SNNs ), we achieve some of the largest and most accurate spiking models for object classification to date. While information in SNNs for computer vision tasks is often encoded in the form of firing rates, the nervous system is known to employ other spike codes optimized for the requirements of a particular sensory pathway. Motivated by evidence of visual processing in humans occurring within milliseconds, we explore encoding schemes that make use of the precise timing of individual spikes to represent information. We show that high classification accuracy can be achieved in artificial systems based on few spikes per neuron. Part of the widespread success of DL can be attributed to the ease with which algorithms and hardware are made accessible today. A beginner can readily find an online notebook that enables them to build, train, and run a full-scale DNN on a remote GPU within minutes, without any overhead setting up software and hardware. To achieve the same on a neuromorphic platform requires intricate understanding of the underlying hardware constraints, and a great deal of manual, low-level programming. One theme in this thesis is the reduction of these obstacles by providing automated tools to convert DNNs to the spike-domain, and to deploy them on neuromorphic hardware. Here, we mainly consider the Intel research processor Loihi, for which we developed a DNN compilation framework. Much of the previous work on SNNs is confined to simulations on general-purpose hardware, which allow no reliable characterization of the actual latency and power consumption of SNNs on dedicated hardware. By means of the toolchain developed here, we are able to perform such benchmarking on standard tasks from computer vision. viiThough SNNs operate on spike events internally, they may receive conventional image frames as input. A more consistent approach is to use event-based input, e. g. from a DVS . In this work we discuss some of the benefits and challenges one can expect when thus combining event-based sensing and processing. In applications ranging from data compression to optical flow and Spiking Neural Networks, we demonstrate computational savings when operating on sparse, informative events rather than dense, redundant frames. Finally, we turn to a biological vision system, the retina, and show that a Dynamic Vision Sensor can be used to drive mouse Retinal Ganglion Cells in vitro - thereby opening a door for applications in retinal prostheses. One recurrent theme in this thesis is the reduction of computational cost of neural networks. In a final study we ask whether the principle of sparse, event-driven updating can be transferred to standard ANNs without the use of spiking neurons. Inspired by how the DVS removes spatio-temporal redundancy from video, we apply a dynamic masking scheme to the layers of a DNN to reduce the number of operations during inference. The algorithm is shown to produce equivalent accuracy results at reduced computational cost on a range of vision tasks including human pose estimation and object detection in static and dynamic scenes. This thesis contributes overall to a fruitful exchange between conventional computer vision on one hand and neuromorphic sensors and processors on the other. Both fields, to various degrees, share the motive for ever increasing efficiency, and many of the seeming restrictions of dedicated hardware, like reduced numeric precision, turn out to be desirable from an algorithmic perspective. Ultimately, we cherish the hope that in building massively constrained neuromorphic systems, we will one day understand more clearly how our brain accomplishes its tasks within a minimal space and energy budget.

Full Text