Accurately modeling respiratory motion in medical images is crucial for various applications, including radiation therapy planning. However, existing registration methods often struggle to extract local features effectively, limiting their performance. In this paper, we aimed to propose a new framework called CvTMorph, which utilizes a Convolutional vision Transformer (CvT) and Convolutional Neural Networks (CNN) to improve local feature extraction. CvTMorph integrates CvT and CNN to construct a hybrid model that combines the strengths of both approaches. Additionally, scaling and square layers are added to enhance the registration performance. We have evaluated the performance of CvTMorph on the 4D-Lung and DIR-Lab datasets and compared it with state-of-the-art methods to demonstrate its effectiveness. The experimental results have demonstrated CvTMorph to outperform the existing methods in terms of accuracy and robustness for respiratory motion modeling in 4D images. The incorporation of the convolutional vision transformer has significantly improved the registration performance and enhanced the representation of local structures. CvTMorph offers a promising solution for accurately modeling respiratory motion in 4D medical images. The hybrid model, leveraging convolutional vision transformer and convolutional neural networks, has proven effective in extracting local features and improving registration performance. The results have highlighted the potential of CvTMorph for various applications, such as radiation therapy planning, and provided a basis for further research in this field.