Video saliency prediction aims to resemble human visual attention by identifying the most relevant and significant elements in a video frame or sequence. This task becomes notably intricate in scenarios characterized by dynamic elements such as rapid motion, occlusions, blur, background variations, and nonrigid deformations. Therefore, the inherent complexity of human visual attention behavior during dynamic scenes necessitates the assessment of both temporal and spatial data. Existing video saliency frameworks often falter under such conditions, and relying solely on image saliency models neglects crucial temporal information in videos. This study presents a new Video Salient Object Detection via Multi-level Spatiotemporal Bidirectional Network using Multi-scale Transfer Learning (MSB-Net) to address the problem of identifying significant objects in videos. The proposed MSB-Net achieves notable results for a given sequence of frames by employing multi-scale transfer learning with an encoder and decoder approach to acquire knowledge and saliency map attributes spatially and temporally. The proposed MSB-Net model has bidirectional LSTM (Long Short-Term Memory) and CNN (Convolutional Neural Network) components. The VGG16 (Video Geometry Group) and VGG19 architectures extract multi-scale features from the input video frames. Evaluation of diverse datasets, namely DAVIS-T, SegTrack-V2, ViSal, VOS-T, and DAVSOD-T, demonstrates the model's effectiveness, outperforming other competitive models based on parameters such as MAE, F-measure, and S-measure.