Enhancing the efficiency and safety of the Intelligent Transportation System requires effective modeling and prediction of citywide traffic dynamics. Most studies employ convolutional neural networks (CNNs) with a 3D convolutional structure or spatio-temporal models with self-attention mechanisms to capture the spatio-temporal information of traffic distribution. Although 3D CNNs excel at capturing local contextual information, they are computationally complex due to the large number of parameters and cannot capture long-range dependence. By contrast, although self-attention mechanisms originally designed to address challenges in natural language processing can capture long-range dependence, their application to 2D image structures requires breaking down the inherent 2D context into a 1D sequence, increasing the computational complexity and neglecting the adaptability between local contextual information and channels. Accordingly, we propose a spatio-temporal visual attention neural network (STVANet), a novel spatio-temporal visual attention 2D CNN, which integrates a unique visual attention module with a large kernel attention (LKA) mechanism, a squeeze-and-excitation (SE) mechanism and a feedforward component to capture long-range dependence and channel information in urban traffic data while preserving the 2D image structure. LKA-based spatio-temporal attention networks extract spatial and temporal features from weekly, daily, and recent hourly periods, and aggregate them with weighted consideration of external features to make predictions. Evaluation of real-world datasets demonstrates STVANet’s superiority over baseline models, showcasing its potential in citywide traffic prediction.
Read full abstract