For solving the facial expression recognition (FER) problem, we introduce a novel feature extractor called the coordinate-based neighborhood attention mechanism (CNAM), which uses the coordinate attention (CA) method to capture the directional relationships in separate horizontal and vertical directions, the input features from a preprocessing unit, and then passes this to two residual blocks, one consisting of the neighborhood attention (NA) mechanism, which captures the local interaction of features within the neighborhood of a feature vector, while the other one contains a channel attention implemented by a multilayer perceptron (MLP). We apply the feature extractor, the CNAM module, to four FER benchmark datasets, namely, RAF-DB, AffectNet(7cls), AffectNet(8cls), and CK+, and through qualitative and quantitative analysis techniques, we conclude that the insertion of the CNAM module could decrease the intra-cluster distances and increase the inter-cluster distances among the high-dimensional feature vectors. The CNAM compares well with other state-of-the-art (SOTA) methods, being the best-performing method for the AffectNet(7cls) and CK+ datasets, while for the RAF-DB and AffectNet(8cls) datasets, its performance is among the top-performing SOTA methods.
Read full abstract