3D human action recognition has received increasing attention due to its potential application in video surveillance equipment. To guarantee satisfactory performance, previous studies are mainly based on supervised methods, which have to add a large amount of manual annotation costs. In addition, general deep networks for video sequences suffer from heavy computational costs, thus cannot satisfy the basic requirement of embedded systems. In this paper, a novel Motion Guided Attention Learning (MG-AL) framework is proposed, which formulates the action representation learning as a self-supervised motion attention prediction problem. Specifically, MG-AL is a lightweight network. A set of simple motion priors (e.g., intra-joint variance, inter-frame deviation, intra-joint variance, and cross-joint covariance), which minimizes additional parameters and computational overhead, is regarded as a supervisory signal to guide the attention generation. The encoder is trained via predicting multiple self-attention tasks to capture action-specific feature representations. Extensive evaluations are performed on three challenging benchmark datasets (NTU-RGB+D 60, NTU-RGB+D 120 and NW-UCLA). The proposed method achieves superior performance compared to state-of-the-art methods, while having a very low computational cost.