Multi-site PM2.5 prediction has emerged as a crucial approach, given that the accuracy of prediction models based solely on data from a single monitoring station may be constrained. However, existing multi-site PM2.5 prediction methods predominantly rely on recurrent networks for extracting temporal dependencies and overlook the domain knowledge related to air quality pollutant dispersion. This study aims to explore whether a superior prediction architecture exists that not only approximates the prediction performance of recurrent networks through feedforward networks but also integrates domain knowledge of PM2.5. Consequently, we propose a novel spatio-temporal attention causal convolutional neural network (Causal-STAN) architecture for predicting PM2.5 concentrations at multiple sites in the Yangtze River Delta region of China. Causal-STAN comprises two components: a multi-site spatio-temporal feature integration module, which identifies temporal local correlation trends and spatial correlations in the spatio-temporal data, and extracts inter-site PM2.5 concentrations from the directional residual block to delineate directional features of PM2.5 concentration dispersion between sites; and a temporal causal attention convolutional network that captures the internal correlation information and long-term dependencies in the time series. Causal-STAN was evaluated using one-year data from 247 sites in mainland China. Compared to six state-of-the-art baseline models, Causal-STAN achieves optimal performance in 6-hour future predictions, surpassing the recurrent network model and reducing the prediction error by 8%–10%.