Visual sensor networks (VSNs) are innovative networks founded on a broad range of areas such as networking, imaging, and database systems. These networks demand well-defined architectures in terms of sensor nodes and camera deployment, image capturing and processing, and well-organized distributed systems. This makes existing VSN architectures deficient because these are limited in approach and in design. In this paper, we propose VISTA, a distributed vision multi-layer architecture aimed at constructing the cumulative vision of mobile objects (MOs). VISTA realizes silhouette recognition of mobile targets through (a) pre-meditated deployment of sensor nodes (SNs) that are equipped with sonar sensors and fixed view (FV) on-board cameras present at the periphery of region of interest (RoI) and SNs with only on-board cameras within RoI, (b) pre-distribution of silhouettes of known objects across SNs, (c) sonar-based presence detection of MO at the outskirts of RoI, (d) MO silhouette capturing and matching at interior node to determine the % age match, (e) subsequent activation of next interior cameras in order to improve % age match, and (f) terminating further activation upon threshold recognition of MO. Experimental evaluation of our image processing algorithms against baseline algorithms with respect to execution time and memory shows significant reduction in image data and memory occupancy. Also, experiments show that true match is achieved fully under broad daylight conditions and large backgrounds when our proposed background subtraction and pixel reduction techniques are used. The mobility-driven behavior of associated network layer algorithms of VISTA is simulated in a network simulator (NS2) by representing the surety of MO identification as a function of number of cameras, database size and distribution, MO’s trajectory, stored perspectives, and network depth. The simulation results show that doubling and, in some situations, manifold increase is observed in the surety of the target with an increase in the number of silhouettes deployed against the baselined database size and mobility model. The results substantiate that VISTA is a suitable architecture for low-cost, autonomous and efficient human and asset monitoring surveillance, friend-or-foe (FoF) identification, and target tracking systems.