In facial expression recognition (FER), global and local features obtained from the same face may have different recognition accuracy, indicating that they have different advantages in recognition. Existing FER works usually focus on extracting and fusing global and local features to obtain better recognition results. Instead of evaluating the advantages of global and local features before fusion, these methods default to fusing them in the same proportion, which probably leads to mutual suppression of information representation between the two features, and then causes worse recognition ability and scene adaptability. To overcome this weakness, this paper proposes a multi-task joint learning network with a constraint fusion (called CFNet). To leverage the key features extracted from different tasks, CFNet adopts a multi-loss mechanism and a constraint fusion method to automatically assign corresponding weights based on the importance of global and local facial information. Compared with existing models that employ the direct fusion strategy, CFNet has better adaptability for FER in complex scenes. Extensive evaluations show the superior effectiveness of CFNet over state-of-the-art methods on real-world emotion datasets. Specifically, the accuracy scores of CFNet on CK+, MMI, and RAF-DB datasets are 99.07%, 84.62%, and 87.52%, respectively. The robustness of CFNet is also verified in noisy and blurred scenes.
Read full abstract