In complex real-world situations, problems such as illumination changes, facial occlusion, and variant poses make facial expression recognition (FER) a challenging task. To solve the robustness problem, this paper proposes an adaptive multilayer perceptual attention network (AMP-Net) that is inspired by the facial attributes and the facial perception mechanism of the human visual system. AMP-Net extracts global, local, and salient facial emotional features with different fine-grained features to learn the underlying diversity and key information of facial emotions. Different from existing methods, AMP-Net can adaptively guide the network to focus on multiple finer and distinguishable local patches with robustness to occlusion and variant poses, improving the effectiveness of learning potential facial diversity information. In addition, the proposed global perception module can learn different receptive field features in the global perception domain, and AMP-Net also supplements salient facial region features with high emotion correlation based on prior knowledge to capture key texture details and avoid important information loss. Many experiments show that AMP-Net achieves good generalizability and state-of-the-art results on several real-world datasets, including RAF-DB, AffectNet-7, AffectNet-8, SFEW 2.0, FER-2013, and FED-RO, with accuracies of 89.25%, 64.54%, 61.74%, 61.17%, 74.48%, and 71.75%, respectively. All codes and training logs are publicly available at <uri xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">https://github.com/liuhw01/AMP-Net</uri> .