A visual question and answering system with support for compound emotions using facial landmark identification with MediaPipe and CNN classifier

Sanskar Mundaniya,Lavika Goel,Nilarnab Debnath

doi:10.1016/j.neucom.2024.127623

Abstract

Visual Question and Answering is a fast-evolving field of research where we attempt to answer the question based on an image as a context. In this paper, we try to add support for answering questions that are based on emotions with face detection and basic emotion extraction. With the basic emotions, we further try to find out compound emotions to expand the area of answerable questions. We complete the said operations in two stages: i) Caption generation. ii) Question and answering with that caption as context. The context generation process can be further divided into i) general caption generator ii) emotional caption generator. In general caption generator, we try to find a single sentence providing a basic information about the input context image, using an Attention [9] based model. In the later phase, the emotional caption generator finds the information of the emotions of the faces of the image using RetinaFace [20] and DeepFace [1]. In emotional caption generator, apart from basic emotion generation, we also try to find compound emotions depicted by each of the faces by learning the patterns in facial landmarks generated by MediaPipe [7] with CNN based architecture. In this module, we also generate a set of know question and answers to use in the next phase. As for the question and answering part, we first use BERT [4] for finding an answer, given a question and the captions generated as context of the question. If the confidence of answer exceeds a threshold value, we report that answer as final. If not, we try to find the known question similar to the asked question. The answer generated then is the final answer.

Full Text