Micro-expressions (MEs) are spontaneous and involuntary facial subtle reactions which often reveal the genuine emotions within human beings. Recognizing MEs automatically is becoming increasingly crucial for many areas such as diagnosis and security. However, the short time duration and low spatial intensity of MEs pose great challenges for accurately recognizing them. Additionally, the lack of sufficient and balanced spontaneous MEs data also makes this problem even harder to solve, and some adaptive modeling strategies have been quite urgent recently. To this end, this paper draws inspirations from few-shot learning to propose a novel two-stage learning (i.e., prior learning and target learning) method based on a siamese 3D convolutional neural network for MEs recognition (MERSiamC3D). Specifically, in the prior learning stage, the proposed MERSiamC3D is used to extract the generic features of MEs. In the target learning stage, the structure and parameters of the MERSiamC3D will be carefully adjusted and the Focal Loss is adopted for high-level features learning. Afterwards, in order to effectively retain the spatiotemporal information of the original MEs video, an adaptive construction method based on adaptive convolutional neural network is proposed to construct the key-frames sequence to summarize the original MEs video, which is able to help drop the redundant frames and relatively highlight the movement of the apex frame. Then, the new key-frames are taken as the input of the two-stage learning method. Finally, through extensive evaluations and experiments on three publically available MEs datasets, the proposed method in this work could outperform traditional methods and other deep learning baselines, which provides a novel insight on how to leverage scarce data for MEs recognition.