Computer vision-based methods for civil structure’s vibration displacement measurement have emerged as useful tools in the recent years. These methods offer several benefits including non-contact measurements, cost-effectiveness, and the ability to capture full-field displacement. Yet, there remain certain challenges. Measuring vibration displacement in 3D typically requires multiple cameras, adding complexity to camera configurations. Moreover, existing methods relied heavily on physical markers or natural key points. Placing physical markers on structures is often impractical, and natural key points are difficult to detect on structures with few distinct features or during rapid movements. Contrary to previous approaches, this paper presents a novel technique that uses a monocular camera for 3D displacement measurements. This technique obviates the need for physical markers or the reliance on natural key points, representing a significant advancement. Central to the method is a deep neural network designed to predict 3D mesh deformation directly from a single image input, combined with an initial 3D cube mesh input. A synthetic 3D dataset is generated to train the neural network. In the testing phase for real structures, advanced video segmentation method is employed to remove the background in order to enhance the prediction accuracy. The practical efficacy of this methodology is validated in a laboratory through a series of experimental tests on beam structures, demonstrating reliable results and application potentials.