Armed with the theory of embodied cognition proposing tight interactions between perception, motor, and cognition, this study aimed to test the hypothesis that speech rate-altered Mandarin lexical tone perception in quiet and noisy environments could be affected by the bodily dynamic cross-modal information. Fifty-three adult listeners completed a Mandarin tone perception task with 720 tone stimuli in auditory-only (AO), auditory-facial (AF), and auditory-facial-plus-gestural (AFG) modalities, at fast, normal, and slow speech rates under quiet and noisy conditions. In AF and AFG modalities, both congruent and incongruent audiovisual information were designed and presented. Generalized linear mixed-effects models were constructed to analyze the accuracy of tone perception across different conditions. In Mandarin tone perception, the magnitude of enhancement of AF and AFG cues across three speech rates was significantly higher than that of the AO cue in the adverse context of noise, yet additional metaphoric gestures did not show significant differences from the facial information. Furthermore, the performance of auditory tone perception at the fast speech rate was significantly better than that at the normal speech rate when the inputs were incongruent between auditory and visual channels in quiet. This study provided compelling evidence showing that integrated audiovisual information plays a vital role not only in improving lexical tone perception in noise but also in modulating the effects of speech rate on Mandarin tone perception in quiet for native listeners. Our findings, supporting the theory of embodied cognition, are implicational for speech and hearing rehabilitation among both young and old clinical populations.