Sign Language Recognition (SLR) systems are crucial bridges facilitating communication between deaf or hard-of-hearing individuals and the hearing world. Existing SLR technologies, while advancing, often grapple with challenges such as accurately capturing the dynamic and complex nature of sign language, which includes both manual and non-manual elements like facial expressions and body movements. These systems sometimes fall short in environments with different backgrounds or lighting conditions, hindering their practical applicability and robustness. This study introduces an innovative approach to isolated sign language word recognition using a novel deep learning model that combines the strengths of both residual three-dimensional (R3D) and temporally separated (R(2+1)D) convolutional blocks. The R3(2+1)D-SLR network model demonstrates a superior ability to capture the intricate spatial and temporal features crucial for accurate sign recognition. Our system combines data from the signer’s body, hands, and face, extracted using the R3(2+1)D-SLR model, and employs a Support Vector Machine (SVM) for classification. It demonstrates remarkable improvements in accuracy and robustness across various backgrounds by utilizing pose data over RGB data. With this pose-based approach, our proposed system achieved 94.52% and 98.53% test accuracy in signer-independent evaluations on the BosphorusSign22k-general and LSA64 datasets.