This paper presents a novel two-stage neural network for quantifying full-field displacements of vibrating structures. A graphics software was used to simulate structural vibrations with random displacement fields and generate image datasets through batch rendering. Subsequently, a two-stage neural network model comprising an object segmentation subnetwork for extracting structural shapes and an optical-flow subnetwork for identifying displacement fields was constructed. A coordinate convolution module was embedded in the optical-flow estimation model and the two subnetworks were integrated to develop a comprehensive model. An experiment was designed to monitor the vibration displacements in a cantilever column. The results showed that the proposed model achieved optimal recognition accuracy with a mean absolute error of 0.1477 pixels and root-mean-squared error of 0.2335 pixels. Additionally, it exhibited robustness to various lighting conditions. Furthermore, case studies on real-world structural vibration videos substantiated the practical engineering application value of the proposed model.