Gait disabilities are among the most frequent impairments worldwide. Their treatment increasingly relies on rehabilitation therapies, in which smart walkers are being introduced to empower the user’s recovery state and autonomy, while reducing the clinicians effort. For that, these should be able to decode human motion and needs, as early as possible. Current walkers decode motion intention using information gathered from wearable or embedded sensors, namely inertial units, force sensors, hall sensors, and lasers, whose main limitations imply an expensive solution or hinder the perception of human movement. Smart walkers commonly lack an advanced and seamless human–robot interaction, which intuitively and promptly understands human motions. A contactless approach is proposed in this work, addressing human motion decoding as an early action recognition/detection problematic, using RGB-D cameras. We studied different deep learning-based algorithms, organised in three different approaches, to process lower body RGB-D video sequences, recorded from an embedded camera of a smart walker, and classify them into 4 classes (stop, walk, turn right/left). A custom dataset involving 15 healthy participants walking with the walker device was acquired and prepared, resulting in 28800 balanced RGB-D frames, to train and evaluate the deep learning networks. The best results were attained by a convolutional neural network with a channel-wise attention mechanism, reaching accuracy values of 99.61% and above 93%, for offline early detection/recognition and trial simulations, respectively. Following the hypothesis that human lower body features encode prominent information, fostering a more robust prediction towards real-time applications, the algorithm focus was also quantitatively evaluated using Dice metric, leading to values slightly higher than 30%. Promising results were attained for early action detection as a human motion decoding strategy, with enhancements in the focus of the proposed architectures.