Video foreground detection (VFD), as one of the basic pre-processing tasks, is very essential for subsequent target tracking and recognition. However, due to the interference of shadow, dynamic background, and camera jitter, constructing a suitable detection network is still challenging. Recently, convolution neural networks have proved its reliability in many fields with their powerful feature extraction ability. Therefore, an interactive spatio-temporal feature learning network (ISFLN) for VFD is proposed in this paper. First, we obtain the deep and shallow spatio-temporal information of two paths with multi-level and multi-scale. The deep feature is conducive to enhancing feature identification capabilities, while the shallow feature is dedicated to fine boundary segmentation. Specifically, an interactive multi-scale feature extraction module (IMFEM) is designed to facilitate the information transmission between different types of features. Then, a multi-level feature enhancement module (MFEM), which provides precise object knowledge for decoder, is proposed to guide the coding information of each layer by the fusion spatio-temporal difference characteristic. Experimental results on LASIESTA, CDnet2014, INO, and AICD datasets demonstrate that the proposed ISFLN is more effective than the existing advanced methods.