The demand for distributed vision systems at an unceasingly larger scale requires the availability of sensor nodes able to execute high-level visual tasks (e.g., object detection for visual monitoring). Such tasks routinely entail the high computational cost of Convolutional Neural Networks (CNN), which conflicts with the tight power budget and memory capacity of sensor nodes at the edge. This paper introduces an approach to reduce the computational cost of object detection in CNN accelerators available in edge devices. The proposed approach induces additional feature map-level sparsity (i.e., computation skipping) at inference time by exploiting temporal correlation among frames. The proposed approach has an uncommonly favorable computation-memory tradeoff, as significant computation reduction is achieved at the cost of very small additional memory for the storage of intermediate features. As further benefits, no architectural changes or retraining are required, allowing immediate deployment in existing vision frameworks and suppressing the need for storing multiple models. Results show that the proposed TempDiff method achieves up to 37% computation reduction with 1.1% accuracy drop based on the SSD(VGG16) object detection network, under both VIRAT and ImageNet-VID datasets. Similarly, 18.3% (35.8%) computation reduction at 3.3% (3.2%) memory overhead, and 3.8% (6.8%) accuracy drop in YOLOv1 (VGG16) (SSD (VGG16)) is achieved under the CAMEL dataset. Furthermore up to 58% computation reduction with 2% accuracy drop and 3.7% memory overhead were achieved for YOLOv3-Tiny network under the ImageNet-VID dataset.