Traditional models for pose estimation in video surveillance are based on graph structures, in this paper, we propose a method that breaks the limitation of template matching within a range of pose changes to obtain robust results. We implement our swimmer pose estimation method based on deep learning. We take use of High-Resolution Net (HRNet) to extract and fuse visual features of visual object and complete the object detection using the key points of human joint. The proposed model could be applied to all kinds of swimming styles throughout appropriate training. Compared with the methods that require multimodel combinations and training, the proposed method directly achieves the end-to-end prediction, which is easily to be implemented and deployed. In addition, a cross-fusion module is added between parallel networks, which assists the network to make use of the characteristics of multiple resolutions. The proposed network has achieved ideal results in the pose estimation of swimmers by comparing HRNet-W32 and HRNet-W48. In addition, we propose an annotated key point dataset of swimmers which was created from the view of underwater swimmers. Compared with side view, the torso of swimmers collected by the underwater view is much suitable for a broad spectrum of machine vision tasks.