Drawing on Mental Spaces Theory and Conceptual Blending Theory in Cognitive Linguistics, we examine the interaction between speech and co-speech gestures in The Daily Show with Trevor Noah. Statistical analysis shows that viewpoint shift in the self-built multimodal corpus, generally following the pattern: Base Space->News Narrative Space(->Base space)->Source Viewpoint Space(+)(->Base space), is significantly related to the use of verbal markers and gestural types. Moreover, multimodal viewpoint shift in political talk shows can achieve such rhetorical functions as enhancing ironic effects, solidifying and highlighting opposing positions, and simplifying political issues. Since verbal markers and gestures can mobilize the audience’s embodied experience by primarily activating mental images and motor programs, we claim that mental simulation and perspective taking play a pivotal role in the cognitive processing of multimodal viewpoint shifts by promoting viewpoint alignment between the audience and the host.