Video-based virtual try-ons have attracted unprecedented attention owing to the development of e-commerce. However, this problem is very challenging because of the arbitrary poses of persons and the demand for temporary consistency of frames, particularly when attempting to synthesize high-quality virtual try-on videos using single images. Specifically, there are two key challenges. 1) The existing video-based virtual try-on methods are based on generative adversarial networks (GAN), which are limited by unstable training and a lack of realism in generated details. 2) The explicit building of stronger constraints of generated frames, which aims to increase the coherence of generated videos. To address these challenges, this study proposed a novel framework, Extended Markov Chain Based Denoising Diffusion Generative Adversarial Network (EMC-DDGAN), which was derived from a denoising diffusion GAN, which is a diffusion model with efficient sampling. Moreover, we proposed an extended Markov chain that used a diffusion model to synthesize frames via sequential generation. With a carefully designed network and learning objects, the proposed approach achieved outstanding performance on public datasets. Rigorous experiments demonstrated that EMC-DDGAN could synthesize higher-quality videos compared to other state-of-the-art methods and validated the effectiveness of the proposed approach.