Abstract

Facial expression recognition (FER) is a promising but challenging area of Computer Vision (CV). Many researchers have devoted significant resources to exploring FER in recent years, but an impediment remains: classifiers perform well on fine resolution images but have difficulty recognizing in-the-wild human emotional states. In order to solve the aforementioned issue, we introduced three novel designs and implemented them in neural networks. More specifically, we utilized an asymmetric pyramidal network (APNet) and employed multi-scale kernels instead of identical size kernels. In addition, square kernels were replaced by a sequence of square, horizontal, and vertical convolutions. This structure can increase the description ability of convolutional neural networks (CNN) and transfer multi-scale features between different layers. Additionally, when training CNN, we adopted stochastic gradient descent with gradient centralization (SGDGC) where it centralizes gradients to have zero mean and makes the training process more efficient and stable. To verify the effectiveness of APNet with SGDGC, we used three of the most popular in-the-wild emotion datasets, FER-2013, CK+, and JAFFE, for our experiments. The results of our experiment and comparisons with state-of-the-art designs from others demonstrate that our method outperforms all the single model methods and has comparable performance with model fusion methods.

Highlights

  • According to Mehrabian’s survey [1], verbal components only convey one-third of the information that humans want to express; the other two-thirds are conveyed through non-verbal components

  • ASYMMETRIC PYRAMIDAL NETWORK In section II, we introduced several prior studies by other researchers that focused on asymmetric convolutions and multi scale blocks or networks, but none have combined these two techniques together and inserted them into a network

  • When we examined the findings of prior studies [44], [48], [49] alongside Eq.3, we observed that the only difference between centralized gradient [∇gradient centralization (GC) L(W )] and standard gradient [∇L(W )] is a deducted mean value from the weight vector or weight matrix

Read more

Summary

INTRODUCTION

According to Mehrabian’s survey [1], verbal components only convey one-third of the information that humans want to express; the other two-thirds are conveyed through non-verbal components. When using the decomposition technique described above, their Inception-v3 model yielded remarkable results in several sub-fields of CV Another experiment using asymmetric blocks was conducted by Ma et al [24], who applied a creative kernel shape into CNN and called the new architecture RotateConv. Xie et al [16] found that grouped convolution could improve the accuracy of classification as well as reduce training time Cognizant of these two advantages, we adopted a similar strategy to capture the features of different levels of layers and applied grouped convolution into APNet. For each layer in the APNet, we divided a 3 × 3 block sequence (square kernel, vertical kernel, and horizontal kernel) into 1 group, a 5 × 5 block sequence into 4 groups, and a 7 × 7 block sequence into 8 groups.

EXPERIMENTS
CONCLUSION
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call