In facial landmark localization, facial region initialization usually plays an important role in guiding the model to learn critical face features. Most facial landmark detectors assume a well-cropped face as input and may underperform in real applications if the input is unexpected. To alleviate this problem, we present a region-aware deep feature-fused network (RDFN). The RDFN consists of a region detection subnetwork and a region-wise landmark localization subnetwork to explicitly solve the input initialization problem and derive the landmark score maps, respectively. To exploit the association between tasks, we develop a cross-task feature fusion scheme to extract multi-semantic region features while trading off their importance in different dimensions via global channel attention and global spatial attention. Furthermore, we design a within-task feature fusion scheme to capture the multi-scale context and improve the gradient flow for the landmark localization subnetwork. At the inference stage, a location reweighting strategy is employed to transform the score maps into 2D landmark coordinates. Extensive experimental results demonstrate that our method has competitive performance compared to recent state-of-the-art methods, achieving NMEs of 3.28%, 1.48%, and 3.43% on the 300W, AFLW, and COFW datasets, respectively.
Read full abstract