Time series forecasting is intricately tied to production and life, garnering widespread attention over an extended period. Enhancing the performance of long-term multivariate time series forecasting (MTSF) poses a highly challenging task, as it requires mining complicated and obscure temporal patterns in many aspects. For this reason, this paper proposes a long-term forecasting model based on multi-domain fusion (VTNet) to adaptively capture and refine multi-scale intra- and inter-variate dependencies. In contrast to previous techniques, we devise a dual-stream learning architecture. Firstly, the fast Fourier Transform (FFT) is adopted to extract frequency domain information. The original sequences are then transformed into 2D visual features in the temporal-frequency domain, and a 2D-TBlock is designed for multi-scale dynamic learning. Secondly, a combination of convolution and recurrent networks continues to explore the local temporal features and preserve the global trend. Finally, multi-modal circulant fusion is applied to achieve a more comprehensive and enriched feature fusion representation, further promoting overall performance. Extensive experiments are conducted on 9 public benchmark datasets and the real-world irrigation water level to showcase VTNet’s promoted performance and generalization. Moreover, VTNet yields 46.93% and 25.36% relative improvements for water level forecasting, revealing its potential application value in water-saving planning and extreme event early warning.