The recently released low-cost Kinect opens up new opportunities to research in human action recognition, by providing both the color images and depth maps. However, how to exploit and fuse useful features from these various sources remains a very challenging problem. In this paper, we propose a novel and effective framework to largely improve the performance of human action recognition using both the RGB videos and depth maps. The key contribution is the proposition of the sparse coding-based temporal pyramid matching approach (ScTPM) for feature representation. Due to the pyramid structure and sparse representation of extracted features, temporal information is well kept and approximation error is reduced. In addition, a novel Center-Symmetric Motion Local Ternary Pattern (CS-Mltp) descriptor is proposed to capture spatial-temporal features from RGB videos at low computational cost. Using the ScTPM-represented 3D joint features and CS-Mltp features, both feature-level fusion and classifier-level fusion are explored that further improves the recognition accuracy. We evaluate the proposed feature extraction, representation, classification and fusion framework on two challenging human action datasets, MSR-Action3D and MSR-DailyActivity3D. Experimental results indicate that our approaches are repeatedly superior to state-of-the-art methods by 6% and 7% on the two datasets, respectively.