Abstract

Lightweight convolutional neural networks are widely used for face detection due to their ability to learn local representations through spatial induction bias and translational invariance. However, convolutional face detectors have limitations in detecting faces under challenging conditions like occlusion, blurring, or changes in facial poses, primarily attributed to fixed-size receptive fields and a lack of global modeling. Transformer-based models have advantages on learning global representations but are insensitive to capture local patterns. To address these limitations, we propose an efficient face detector that combines convolutional neural network and transformer architectures. We introduce a bi-stream structure that integrates convolutional neural network and transformer blocks within the backbone network, enabling the preservation of local pattern features and the extraction of global context. To further preserve the local details captured by convolutional neural networks, we propose a feature enhancement convolution block in a hierarchical backbone structure. Additionally, we devise a multiscale feature aggregation module to enhance obscured and blurred facial features. Experimental results demonstrate that our method has achieved improved lightweight face detection accuracy with an average precision of 95.30%, 94.20%, and 87.56% across the easy, medium, and hard subdatasets of WIDER FACE, respectively. Therefore, we believe our method will be a useful supplement to the collection of current artificial intelligence models and benefit the engineering applications of face detection.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call