Deep learning based on convolutional neural network (CNN) has shown promising results in various vision-based applications, recently also in camera-based vital signs monitoring. The CNN-based photoplethysmography (PPG) extraction has, so far, been focused on performance rather than understanding. In this paper, we try to answer four questions with experiments aiming at improving our understanding of this methodology as it gains popularity. We conclude that the network exploits the blood absorption variation to extract the physiological signals, and that the choice and parameters (phase, spectral content, etc.) of the reference-signal may be more critical than anticipated. The availability of multiple convolutional kernels is necessary for CNN to arrive at a flexible channel combination through the spatial operation, but may not provide the same motion-robustness as a multi-site measurement using knowledge-based PPG extraction. We also find that the PPG-related prior knowledge may still be helpful for the CNN-based PPG extraction, and recommend further investigation of hybrid CNN-based methods that include prior knowledge in their design.