Facial expression recognition (FER) is a promising but challenging area of Computer Vision (CV). Many researchers have devoted significant resources to exploring FER in recent years, but an impediment remains: classifiers perform well on fine resolution images but have difficulty recognizing in-the-wild human emotional states. In order to solve the aforementioned issue, we introduced three novel designs and implemented them in neural networks. More specifically, we utilized an asymmetric pyramidal network (APNet) and employed multi-scale kernels instead of identical size kernels. In addition, square kernels were replaced by a sequence of square, horizontal, and vertical convolutions. This structure can increase the description ability of convolutional neural networks (CNN) and transfer multi-scale features between different layers. Additionally, when training CNN, we adopted stochastic gradient descent with gradient centralization (SGDGC) where it centralizes gradients to have zero mean and makes the training process more efficient and stable. To verify the effectiveness of APNet with SGDGC, we used three of the most popular in-the-wild emotion datasets, FER-2013, CK+, and JAFFE, for our experiments. The results of our experiment and comparisons with state-of-the-art designs from others demonstrate that our method outperforms all the single model methods and has comparable performance with model fusion methods.
Read full abstract