How do we perceive multisensory rhythms in a dynamic environment, such as in a ballet, where we experience the rhythm of music accompanied by the rhythm of dancers’ movements? Multisensory perception has often been examined in scenarios where the observed action produces the sounds, e.g., drumming movement paired with impact sounds. However, little is known of how the brain combines coordinated audiovisual information that is not contingent upon each other, as in dancing. Two studies reported here investigated the mechanisms that are used to integrate audiovisual temporal information in such a rhythmic context. The visual stream consisted of a point-light figure (PLF) that moved (bounced) periodically to the beat of the auditory rhythms, resembling the dance scenario.In the first study, participants judged synchrony between a bouncing PLF and a simple auditory rhythm in a synchrony judgment (SJ) task. The trajectory of the PLF was manipulated to follow two possible naturalistic motion profiles, one of human bouncing and the other of ball bouncing. Despite both being rhythmic and following the same path, the two profiles differed as to when the peak velocity occurred. Audiovisual SJ was required with regard to the same spatial position of the movement. It was found that the point of subjective simultaneity (PSS) differed between these two visual conditions, reflecting the difference in the occurrence of peak velocity. The result shows that synchrony perception was implicitly influenced by the spatiotemporal (i.e., velocity) cue in the visual movement. Specifically, peak velocity in the trajectory was taken as visual reference for the task, with which the auditory beat should coincide. This parallels previous findings that velocity cues define the beat of visual biological motion, suggesting that audiovisual synchrony perception involving rhythmic, naturalistic movements relies on the perceived visual beat.The second study investigated beat perception of concurrent auditory and visual rhythms, specifically whether a visual beat as conveyed by the bouncing PLF modulated auditory rhythm perception. A rhythm reproduction task and a rhythm perception task examined the effect of auditory, visual, and audiovisual (bimodal) beat induction on the perception of metrically complex auditory rhythms. While it proved difficult to improve the perception of these rhythms with an explicit beat in either or both modalities, likely due to syncopation, a bimodal beat led to greater on-beat than off-beat sensitivity to a temporal deviant in the rhythm, consistent with the entrainment theory. Moreover, the PLF movement had more influence than the concurrent auditory beat in this process, suggesting that a rhythmic humanlike movement can serve an effective visual beat that modulates auditory rhythm perception.Both studies demonstrate that the perceptual system extracts a visual beat from the observed, rhythmic movements to form a coherent percept with auditory rhythms. The critical feature of a visual beat for inter-sensory interaction may lie in the velocity profile of naturalistic motion. The present findings also suggest the possibility of cross-modal (visual to auditory) rhythm and beat perception, at least when appropriate visual movement stimuli are involved.