The adult brain appears to have a capability of location invariance: No matter where on the retina an object appears, the brain recognizes the object. Yet, this does not mean that the brain drops the location information, since it needs such information for arm reaching. Mishkin and coworkers 1983 [1] reported that the dorsal and ventral streams of the brain are correlated to space (“where”) and object (“what”) information, respectively, based largely on their brain lesion studies. Many later experimental studies verified and enriched this discovery, the working and learning of these two streams have been elusive (Deco & Rolls 2004 [2]). It has been known that feedback connections are widely present along these streams, but computational understanding and analysis are lacking. On the other hand, the sensory cortex alone seems to use distributed representations. Each feature neuron has a receptive field, corresponding to a patch in the retina. There are multiple nearby neurons whose receptive fields almost completely overlap and they detect different features of the overlapping patches (e.g., each for a different edge orientation). However, such distributed “patch representations” must be combined somehow to give rise to behaviors that demonstrate invariant object recognition. Ann Treisman [2] and David Van Essen et al. [4] proposed the existence of a mater feature map. Following neuro-anatomical data, our visuomotor model Where What Network (WWN) suggests that to understand the causality of the above phenomenon, it is beneficial to go beyond PP and IT to include the premotor and motor areas in the frontal cortex. We introduce two motor areas as the integral parts of cortical object representation: location motor (LM) and type motor (TM). The former correlates the frontal eye field (FEF) and the location-relevant control areas in the pre-motor and motor areas. The latter corresponds to the ventral frontal cortex (VFC) and the verbal control area in the pre-motor and motor areas. The dorsal stream plus LM learns type invariance and location specificity (for, e.g., arm reaching). The ventral stream plus TM learns location invariance and type specificity (for, e.g., pronounce the object type). Bottom-up and top-down connections from LM and TM dynamically wire connections and shape the corresponding streams, resulting in complementary representations: invariance with one is specificity with the other. WWNs were tested to deal with the tightly intertwined attention and recognition for vision in the presence of complex backgrounds. Each of attention and recognition has been modeled separately in previous work, e.g., visual saliency guides covert attention sifts. How the visual cortex deals with both attention and recognition from complex natural backgrounds conjunctively has been elusive. WWN gives the first biological plausible theory for this joint problem. With general object in complex new backgrounds, WWN reported 95% in classification rate and under 2-pixel location error, when about 75% images areas are from unknown complex backgrounds. Each WWN epigenetically generates and adapts emergent representations using Hebbian like neuronal learning mechanisms. WWN explains how top-down attention originates from LM for location-based and TM for type-based top-down attention. This model does not need an the appearance-kept internal master feature maps proposed earlier.