According to the World Health Organization (WHO), 5% of people around the world have hearing disabilities, which limits their capacity to communicate with others. Recently, scientists have proposed systems based on deep learning techniques to create a sign language-to-text translator, expecting this to help deaf people communicate; however, the performance of such systems is still low for practical scenarios. Furthermore, the proposed systems are language-oriented, which leads to particular problems related to the signs for each language. For this reason, to address this problem, in this paper, we propose a system based on a Recursive Neural Network (RNN) focused on Mexican Sign Language (MSL) that uses the spatial tracking of hands and facial expressions to predict the word that a person intends to communicate. To achieve this, we trained four RNN-based models using a dataset of 600 clips that were 30 s long; each word included 30 clips. We conducted two experiments; we tailored the first experiment to determine the most well-suited model for the target application and measure the accuracy of the resulting system in offline mode; in the second experiment, we measured the accuracy of the system in online mode. We assessed the system’s performance using the following metrics: the precision, recall, F1-score, and the number of errors during online scenarios, and the results computed indicate an accuracy of 0.93 in the offline mode and a higher performance for the online operating mode compared to previously proposed approaches. These results underscore the potential of the proposed scheme in scenarios such as teaching, learning, commercial transactions, and daily communications among deaf and non-deaf people.