Abstract
Automated face mask classification has surfaced recently following the COVID-19 mask wearing regulations. The current State-of-The-Art of this problem uses CNN-based methods such as ResNet. However, attention-based models such as Transformers emerged as one of the alternatives to the status quo. We explored the Transformer-based model on the face mask classification task using three models: Vision Transformer (ViT), Swin Transformer, and MobileViT. Each model is evaluated with a top-1 accuracy score of 0.9996, 0.9983, and 0.9969, respectively. We concluded that the Transformer-based model has the potential to be explored further. We recommended that the research community and industry explore its integration implementation with CCTV.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have