Continuous sign language recognition (CSLR) is essential for the social participation of deaf individuals. The structural information of sign language motion units plays a crucial role in semantic representation. However, most existing CSLR methods treat motion units as a whole appearance in the video sequence, neglecting the exploitation and explanation of structural information in the models. This paper proposes a Structure-Aware Graph Convolutional Neural Network (SA-GNN) model for CSLR. This model constructs a spatial–temporal scene graph, explicitly capturing motion units’ spatial structure and temporal variation. Furthermore, to effectively train the SA-GNN, we propose an adaptive bootstrap strategy that enhances weak supervision using dense pseudo labels. This strategy incorporates a confidence cross-entropy loss to adjust the distribution of pseudo labels adaptively. Extensive experiments validate the effectiveness of our proposed method, achieving competitive results on popular CSLR datasets.