Heart rate (HR), heart rate variability (HRV), and respiratory rate (RR) are vital physiological signals that can reveal the physiological and psychological states of human beings. Recent research studies have demonstrated that these physiological signals can be estimated in a non-contact way from visible or infrared facial videos based on remote Photoplethysmography (rPPG). However, most existing methods only use only one data modality and output a single type of physiological signal at a time. To overcome these restrictions, we propose a multi-task framework named the SMP-Net based on multimodal fusion to realize the non-contact multi-physiological signals estimation. First, a joint attention feature fusion (JAFF) module is designed to encode and fuse features from visible and infrared videos. The JAFF module considers different modality-wise, spatial, and channel-wise information on features comprehensively. Then, a task-oriented feature refinement (TOFR) module is developed to extract task-oriented features by refining shared features to improve the estimation performance. Finally, the proposed SMP-Net is validated on the MMVS, VIPL-HR, and UBFC-rPPG datasets and outperforms state-of-the-art methods. Our proposed SMP-Net achieves 1.12 bpm mean absolute error (MAE) for HR estimation, 0.58 ρ for rPPG estimation, and 2.08 MAE for RR estimation on the MMVS dataset. On the VIPL-HR and UBFC-rPPG datasets, the MAE of HR estimation is 2.03 and 0.59 bpm, respectively. The proposed SMP-Net is of great significance for continuous non-contact estimation of multi-physiological signals.