Predicting the distribution of people in the time window approaching a disaster is crucial for post-disaster assistance activities and can be useful for evacuation route selection and shelter planning. However, two major limitations have not yet been addressed: (1) Most spatiotemporal prediction models incorporate spatiotemporal features either directly or indirectly, which results in high information redundancy in the parameters of the prediction model and low computational efficiency. (2) These models usually incorporate certain basic and external features, and they can neither change spatiotemporal addressed features according to spatiotemporal features nor change them in real-time according to spatiotemporal features. The spatiotemporal feature embedding methods for these models are inflexible and difficult to interpret. To overcome these problems, a lightweight population density distribution prediction framework that considers both basic and external spatiotemporal features is proposed. In the study, an autoencoder is used to extract spatiotemporal coded information to form a spatiotemporal attention mechanism, and basic and external spatiotemporal feature attention is fused by a fusion framework with learnable weights. The fused spatiotemporal attention is fused with Resnet as the prediction backbone network to predict the people distribution. Comparison and ablation experimental results show that the computational efficiency and interpretability of the prediction framework are improved by maximizing the scalability of the spatiotemporal features of the model by unleashing the scalability of the spatiotemporal features of the model while enhancing the interpretability of the spatiotemporal information as compared to the classical and popular spatiotemporal prediction frameworks. This study has a multiplier effect and provides a reference solution for predicting population distributions in similar regions around the globe.