In new technology scenarios, virtual try-on aims to integrate clothing onto body images naturally, enhancing shopping experience by simulating true effect of clothes. As the image resolution increases, we expect to improve the consistency of the result, i.e., to ensure that various elements of the image are harmonized in terms of color, shading, style, and texture in order to achieve a natural visual effect. Many studies based on Generative Adversarial Networks(GANs) struggle with consistency. They encounter challenges in accurately depicting the fabric of target garments, as well as natural shadows and folds, and sometimes exhibit visual discontinuities or inconsistencies. So, we propose a new approach CSD-VTON based on latent diffusion model. Considering that traditional UNet's computational primitives struggle to capture complex transformation relationships at the pixel level, we address this issue by concatenating the warped cloth images generated by the warping module with the noise image. Additionally, cascade feature extraction module is introduced to extract in-store garment features, which ensures the preservation of texture and details in the target garments. Finally, we incorporate the skip-connection supplementary module to compensate for the reconstruction error. We conducted experiments using the DressCode and VITON-HD datasets to demonstrate the effectiveness and superiority of our approach.