Learning in a multisensory world is challenging as the information from different sensory dimensions may be inconsistent and confusing. By adulthood, learners optimally integrate bimodal (e.g. audio-visual, AV) stimulation by both low-level (e.g. temporal synchrony) and high-level (e.g. semantic congruency) properties of the stimuli to boost learning outcomes. However, it is unclear how this capacity emerges and develops. To approach this question, we examined whether preverbal infants were capable of utilizing high-level properties with grammar-like rule acquisition. In three experiments, we habituated pre-linguistic infants with an audio-visual (AV) temporal sequence that resembled a grammar-like rule (A-A-B). We varied the cross-modal semantic congruence of the AV stimuli (Exp 1: congruent syllables/faces; Exp 2: incongruent syllables/shapes; Exp 3: incongruent beeps/faces) while all the other low-level properties (e.g. temporal synchrony, sensory energy) were constant. Eight- to ten-month-old infants only learned the grammar-like rule from AV congruent stimuli pairs (Exp 1), not from incongruent AV pairs (Exp 2, 3). Our results show that similar to adults, preverbal infants’ learning is influenced by a high-level multisensory integration gating system, pointing to a perceptual origin of bimodal learning advantage that was not previously acknowledged.