In typical spatial orienting tasks, the perception of crossmodal (e.g., audiovisual) stimuli evokes greater pupil dilation and microsaccade inhibition than unisensory stimuli (e.g., visual). The characteristic pupil dilation and microsaccade inhibition has been observed in response to “salient” events/stimuli. Although the “saliency” account is appealing in the spatial domain, whether this occurs in the temporal context remains largely unknown. Here, in a brief temporal scale (within 1 s) and with the working mechanism of involuntary temporal attention, we investigated how eye metric characteristics reflect the temporal dynamics of perceptual organization, with and without multisensory integration. We adopted the crossmodal freezing paradigm using the classical Ternus apparent motion. Results showed that synchronous beeps biased the perceptual report for group motion and triggered the prolonged sound-induced oculomotor inhibition (OMI), whereas the sound-induced OMI was not obvious in a crossmodal task-free scenario (visual localization without audiovisual integration). A general pupil dilation response was observed in the presence of sounds in both visual Ternus motion categorization and visual localization tasks. This study provides the first empirical account of crossmodal integration by capturing microsaccades within a brief temporal scale; OMI but not pupillary dilation response characterizes task-specific audiovisual integration (shown by the crossmodal freezing effect).