Heart rate (HR) is an important indicator reflecting the overall physical and mental health of the human body, playing a crucial role in diagnosing cardiovascular and neurological diseases. Recent research has revealed that variations in the light absorption of human skin captured through facial video over the cardiac cycle, due to changes in blood volume, can be utilized for non-contact HR estimation. However, most existing methods rely on single-modal video sources (such as RGB or NIR), which often yield suboptimal results due to noise and the limitations of a single information source. To overcome these challenges, this paper proposes a multimodal information fusion architecture named the spatiotemporal sensitive network (SS-Net) for non-contact heart rate estimation. Firstly, spatiotemporal feature maps are utilized to extract physiological signals from RGB and NIR videos effectively. Next, a spatiotemporal sensitive (SS) module is introduced to extract useful physiological signal information from both RGB and NIR spatiotemporal maps. Finally, a multi-level spatiotemporal context fusion (MLSC) module is designed to fuse and complement information between the visible light and infrared modalities. Then, different levels of fused features are refined in task-specific branches to predict both remote photoplethysmography (rPPG) signals and heart rate (HR) signals. Experiments conducted on three datasets demonstrate that the proposed SS-Net achieves superior performance compared to existing methods.