In real-world listening situations, individuals typically utilize head and eye movements to receive and enhance sensory information while exploring acoustic scenes. However, the specific patterns of such movements have not yet been fully characterized. Here, we studied how movement behavior is influenced by scene complexity, varied in terms of reverberation and the number of concurrent talkers. Thirteen normal-hearing participants engaged in a speech comprehension and localization task, requiring them to indicate the spatial location of a spoken story in the presence of other stories in virtual audio-visual scenes. We observed delayed initial head movements when more simultaneous talkers were present in the scene. Both reverberation and a higher number of talkers extended the search period, increased the number of fixated source locations, and resulted in more gaze jumps. The period preceding the participants’ responses was prolonged when more concurrent talkers were present, and listeners continued to move their eyes in the proximity of the target talker. In scenes with more reverberation, the final head position when making the decision was farther away from the target. These findings demonstrate that the complexity of the acoustic scene influences listener behavior during speech comprehension and localization in audio-visual scenes.