Ultrasonic wave propagation imaging enables the detection of anomalies in various structures; hence, it has been applied as one of the promising techniques for damage identification in structural health monitoring (SHM). The interpretation of imaging data is vital to SHM; however, it relies significantly on expert subjective judgment, rendering the results vulnerable to human errors. Recent advances in the field of computer vision arising from the adoption of deep neural networks have resulted in new perspectives for substituting human roles in laborious data interpretation tasks. This paper presents an effective learning architecture that can characterize the ultrasonic wave propagation videos for automatic non-destructive inspection. The main contribution is threefold: 1. To the best of our knowledge, this is the first study to leverage video content analysis techniques to exploit ultrasonic wave propagation image series. Previous approaches that focused on the still wavefield images are likely to lose critical temporal information, thereby resulting in an inferior performance. 2. We devise a model that progressively aggregates both temporal and spatial information encoded in multiple adjacent snapshots of ultrasonic wave propagation motions for efficient data analysis. We presented the details regarding the system implementation and critical parameter settings. 3. The proposed approach is validated through extensive experimental comparisons with other state-of-the-art computer vision techniques on a real dataset which is publicly available. We hope that this study will encourage further investigations into video-based non-destructive data interpretation, not limited to ultrasonic signals.