In multimedia content analysis, spatiotemporal saliency detection plays a crucial role in understanding visual data. However, existing methods often struggle with efficiently capturing complex patterns in videos. To address this, we propose a Multi-modal GraphNet Learning-Based Feature Extraction approach. Our method integrates multi-modal information from both spatial and temporal domains to enhance saliency detection accuracy. By leveraging GraphNet, we effectively model the intricate relationships among video frames. We validate our approach on a diverse set of multimedia videos, demonstrating significant improvements in saliency detection performance. Specifically, our method achieves an average precision of 0.85 and a recall of 0.78, outperforming state-of-the-art techniques. Furthermore, our approach exhibits robustness across various video types and scenarios. Through experimental evaluation, we confirm the efficacy of our proposed method in enhancing spatiotemporal saliency detection. This work contributes to advancing the field of multimedia analysis, offering a promising solution for understanding visual content in videos.