Human action recognition is a key component in modern artificial intelligent systems that greatly enhance the manipulating of various robots, such as rehabilitation robtos and industrial robots. Existing action recognition algorithms mainly depend on a predefined spatial sequence code book, which may fail to discover discriminative spatial–temporal features to mimic robots. In this paper, we propose to engineer the spatial–temporal action features that can deeply encode the similarity of within-class human actions and dissimilarity of between-class human actions. Specifically, given a series of training action video samples, we first segment each video into multiple key sections based on human contour. These sections of a video are related to time and space. Then, local robot action and appearance information are combined using a cerebellar model articulation controller (CMAC) to represent each video section. We quantize these extracted features into a feature vector, which can represent category-specific robot actions. Subsequently, we develop an improved linear discriminative analysis to project the data points to a subspace, where data points with the same label are close while data points with different labels are far from each other. Experimental results on the well-known HMDB51 and KTH datasets have shown the effectiveness and robustness of our method. Moreover, our action recognition can be applied as a key module in the real-world rehabilitation robots.