Temporal structures in the environment can shape temporal expectations (TE); and previous studies demonstrated that TEs interact with multisensory interplay (MSI) when multisensory stimuli are presented synchronously. Here, we tested whether other types of MSI – evoked by asynchronous yet temporally flanking irrelevant stimuli – result in similar performance patterns. To this end, we presented sequences of 12 stimuli (10 Hz) which consisted of auditory (A), visual (V) or alternating auditory-visual stimuli (e.g. A-V-A-V-…) with either auditory or visual targets (Exp. 1). Participants discriminated target frequencies (auditory pitch or visual spatial frequency) embedded in these sequences. To test effects of TE, the proportion of early and late temporal target positions was manipulated run-wise. Performance for unisensory targets was affected by temporally flanking distractors, with auditory temporal flankers selectively improving visual target perception (Exp. 1). However, no effect of temporal expectation was observed. Control experiments (Exp. 2–3) tested whether this lack of TE effect was due to the higher presentation frequency in Exp. 1 relative to previous experiments. Importantly, even at higher stimulation frequencies redundant multisensory targets (Exp. 2–3) reliably modulated TEs. Together, our results indicate that visual target detection was enhanced by MSI. However, this cross-modal enhancement – in contrast to the redundant target effect – was still insufficient to generate TEs. We posit that unisensory target representations were either instable or insufficient for the generation of TEs while less demanding MSI still occurred; highlighting the need for robust stimulus representations when generating temporal expectations.