Monitoring oxygen saturation ( ) is important in healthcare, especially for diagnosing and managing pulmonary diseases. Non-contact approaches broaden the potential applications of measurement by better hygiene, comfort, and capability for long-term monitoring. However, existing studies often encounter challenges such as lower signal-to-noise ratios and stringent environmental conditions. We aim to develop and validate a contactless measurement approach using 3D convolutional neural networks (3D CNN) and 3D visible-near-infrared (VIS-NIR) multimodal imaging, to offer a convenient, accurate, and robust alternative for monitoring. We propose an approach that utilizes a 3D VIS-NIR multimodal camera system to capture facial videos, in which is estimated through 3D CNN by simultaneously extracting spatial and temporal features. Our approach includes registration of multimodal images, tracking of the 3D region of interest, spatial and temporal preprocessing, and 3D CNN-based feature extraction and regression. In a breath-holding experiment involving 23 healthy participants, we obtained multimodal video data with reference values ranging from 80% to 99% measured by pulse oximeter on the fingertip. The approach achieved a mean absolute error (MAE) of 2.31% and a Pearson correlation coefficient of 0.64 in the experiment, demonstrating good agreement with traditional pulse oximetry. The discrepancy of estimated values was within 3% of the reference for of all 1-s time points. Besides, in clinical trials involving patients with sleep apnea syndrome, our approach demonstrated robust performance, with an MAE of less than 2% in estimations compared to gold-standard polysomnography. The proposed approach offers a promising alternative for non-contact oxygen saturation measurement with good sensitivity to desaturation, showing potential for applications in clinical settings.