This letter mainly considers environmental clutter problem in distinguishing between stationary humans and animals through-wall circumstances. Focusing on the challenges of object identification in the time-frequency map, we propose a cross-scale feature aggregation (CSFA) network based on channel-spatial attention, which can improve the identification accuracy of stationary human and animal. Specifically, life detection radar is utilized to collect data and time-frequency analysis method synchrosqueezing transform (SST) is used to suppress the signal noise and generate higher resolution time-frequency maps. In order to make full use of the target information, we use a feature pyramid network (FPN) to obtain multilevel feature information maps from time-frequency maps. After that, the CSFA module is utilized to extract detailed micro-Doppler feature information from feature maps. And we use deep Convolutional Neural Network (CNN) to classify humans from animals. Experimental results show that the proposed model has better performance in accuracy compared with the existing methods.