Traditional visual vibration measurement methods face limitations in dynamic outdoor environments, making accurate multi-point vibration measurements challenging for the targeted objects. In this paper, a visual vibration measurement algorithm for large-span bridges is proposed by anthropomorphizing the bridge target and introducing key point detection at the same time. Firstly, the algorithm utilizes a convolutional neural network to extract multi-scale feature information from the image sequence of the bridge model. Secondly, to enhance target detection accuracy, a coordinate attention (CA) mechanism is incorporated, and the shape intersection over union (SIoU) loss function replaces the original loss function. Finally, the algorithm integrates the density-based spatial clustering of applications with noise (DBSCAN) and tracking algorithms to achieve precise localization of bridge targets. Vibration displacement data were extracted from a large-span suspension bridge structure in an outdoor environment and analyzed quantitatively and qualitatively. The vibration displacement curves obtained in this study show the highest level of agreement with the standard displacement signals, with a mean absolute percentage error of 0.5170% for the cable-stayed bridge model, 1.3822% for the Humen Bridge, and 3.2263% and 1.6982% for the Longjiang Bridge.