The inputs delivered to different sensory organs provide us with complementary information about the environment. Our recent study demonstrated that presenting abstract visual information of speech envelopes substantially improves speech perception ability in normal-hearing (NH), listeners [Yuan et al., J. Acoust. Soc. Am. (2020)]. The purpose of this study was to expand this audiovisual speech perception to the tactile domain. Twenty adults participated in sentence recognition threshold measurements in four different sensory modalities (AO: auditory-only; AV: auditory-visual; AT: audio-tactile; and AVT: audio-visual-tactile). The target sentence [CRM speech corpus, Bolia et al., J. Acoust. Soc. Am. (2000)] level was fixed at 60 dBA, and the masker (speech-shaped noise) levels were adaptively varied to find masked thresholds. The amplitudes of both visual and vibrotactile stimuli were temporally synchronized and non-synchronized with the target speech envelope for comparison. Results show that temporally coherent multi-modal stimulation (AV, AT, and AVT) significantly improves speech perception ability when compared to audio-only (AO) stimulation. These multisensory speech perception benefits were reduced when the cross-modal temporal coherence characteristics were eliminated. These findings suggest that multisensory interactions are fundamentally important for speech perception ability in NH listeners. The outcome of this multisensory speech processing highly depends on temporal coherence characteristics between multi-modal sensory inputs.