Selective attentional biases arising from one sensory modality manifest in others. The effects of visuospatial attention, important in visual object perception, are unclear in the auditory domain during audiovisual (AV) scene processing. We investigate temporal and spatial factors that underlie such transfer neurally. Auditory encoding of random tone pips in AV scenes was addressed via a temporal response function model (TRF) of participants' electroencephalogram (N = 30). The spatially uninformative pips were associated with spatially distributed visual contrast reversals ("flips"), through asynchronous probabilistic AV temporal onset distributions. Participants deployed visuospatial selection on these AV stimuli to perform a task. A late (~300 ms) cross-modal influence over the neural representation of pips was found in the original and a replication study (N = 21). Transfer depended on selected visual input being (i) presented during or shortly after a related sound, in relatively limited temporal distributions (<165 ms); (ii) positioned across limited (1:4) visual foreground to background ratios. Neural encoding of auditory input, as a function of visual input, was largest at visual foreground quadrant sectors and lowest at locations opposite to the target. The results indicate that ongoing neural representations of sounds incorporate visuospatial attributes for auditory stream segregation, as cross-modal transfer conveys information that specifies the identity of multisensory signals. A potential mechanism is by enhancing or recalibrating the tuning properties of the auditory populations that represent them as objects. The results account for the dynamic evolution under visual attention of multisensory integration, specifying critical latencies at which relevant cortical networks operate.