Abstract Visual vibration measurement has emerged in the field of structural health monitoring in recent years, but it still has some shortcomings in terms of resolution, recognition rate and real-time performance. Considering the three aspects of recovering high-frequency image details, improving the compactness of the target bounding box, and reducing the computational time, we use the constructed image super-resolution reconstruction model and target detection model to measure the vibration displacement of the bridge structural model. First, we integrate the Transformer module into the Unet network with a simple structure. The Swin and Global Transformer Unet (SGTU) module constructed in this form can reduce the computational cost while reconstructing the large-resolution feature map target, and it can sharply edge information of the vibration target. We use the framework of the YOLOv5 algorithm as the backbone, and use the GhostBottleneck (GB) module to reduce the time for convolution operations to generate similar features. In addition, the proposed DWCBottleneck (DWCB) fusion module is also able to achieve high-level semantic fusion and network depth expansion with minimal computational cost. Finally, the center point offset of the bounding box predicted by the model can be used to obtain the displacement offset of the object in the image sequence. The position information of the target in the first frame image is used as the reference frame for calculating the offset, and the vibration displacement of the flexible structure in the image coordinate system is obtained by calculating the deviation of the displacement between the remaining frames and the first frame. We perform qualitative and quantitative comparisons in three aspects: video super-resolution reconstruction, visual detection robustness, and sensor vibration measurement displacement using a homemade vibration image dataset. The time-frequency domain displacement curves regressed by the visual vibration measurement algorithm are compared with the curves acquired after accelerometer acquisition, indicating the necessity of super-resolution reconstruction in visual vibration measurement.