The advancement of deep learning techniques has significantly propelled the development of the continuous sign language recognition (cSLR) task. However, the spatial feature extraction of sign language videos in the RGB space tends to focus on the overall image information while neglecting the perception of traits at different granularities, such as eye gaze and lip shape, which are more detailed, or posture and gestures, which are more macroscopic. Exploring the efficient fusion of visual information of different granularities is crucial for accurate sign language recognition. In addition, applying a vanilla Transformer to sequence modeling in cSLR exhibits weak performance because specific video frames could interfere with the attention mechanism. These limitations constrain the capability to understand potential semantic characteristics. We introduce a feature fusion method for integrating visual features of disparate granularities and refine the metric of attention to enhance the Transformer’s comprehension of video content. Specifically, we extract CNN feature maps with varying receptive fields and employ a self-attention mechanism to fuse feature maps of different granularities, thereby obtaining multi-scale spatial features of the sign language framework. As for video modeling, we first analyze why the vanilla Transformer failed in cSLR and observe that the magnitude of the feature vectors of video frames could interfere with the distribution of attention weights. Therefore, we utilize the Euclidean distance among vectors to measure the attention weights instead of scaled-dot to enhance dynamic temporal modeling capabilities. Finally, we integrate the two components to construct the model MSF-ET (Multi-Scaled feature Fusion–Euclidean Transformer) for cSLR and train the model end-to-end. We perform experiments on large-scale cSLR benchmarks—PHOENIX-2014 and Chinese Sign Language (CSL)—to validate the effectiveness.