Abstract Learning the meaning of a word is a difficult task due to the variety of possible referents present in the environment. Visual cues such as gestures frequently accompany speech and have the potential to reduce referential uncertainty and promote learning, but the dynamics of pointing cues and speech integration are not yet known. If word learning is influenced by when, as well as whether, a learner is directed correctly to a target, then this would suggest temporal integration of visual and speech information can affect the strength of association of word–referent mappings. Across two pre-registered studies, we tested the conditions under which pointing cues promote learning. In a cross-situational word learning paradigm, we showed that the benefit of a pointing cue was greatest when the cue preceded the speech label, rather than following the label (Study 1). In an eye-tracking study (Study 2), the early cue advantage was due to participants’ attention being directed to the referent during label utterance, and this advantage was apparent even at initial exposures of word–referent pairs. Pointing cues promote time-coupled integration of visual and auditory information that aids encoding of word–referent pairs, demonstrating the cognitive benefits of pointing cues occurring prior to speech.