The emergence of cost-effective depth sensors opens up a new dimension for RGB-D based human action recognition. In this paper, we propose a collaborative multimodal feature learning (CMFL) model for human action recognition from RGB-D sequences. Specifically, we propose a robust spatio-temporal pyramid feature (RSTPF) to capture dynamic local patterns around each human joint. The proposed CMFL model fuses multimodal data (skeleton, depth and RGB), and learns action classifiers using the fused features. The original low-level feature matrices are factorized to learn shared features and modality-specific features under a supervised fashion. The shared features describe the common structures among the three modalities while the modality-specific features capture intrinsic information of each modality. We formulate shared-specific features mining and action classifiers learning in a unified max-margin framework, and solve the formulation using an iterative optimization algorithm. Experimental results on four action datasets demonstrate the efficacy of the proposed method.