Abstract

According to physiological reports, individuals with different levels of depression present various facial dynamic patterns. Thus, researchers have examined facial changes to predict depression severity. However, they often use facial images or videos of the subjects, which increases the risk of privacy leakage. Therefore, we intend to use the mixer layer to process facial keypoint sequences to predict the depression levels. However, the mixer layer cannot guarantee that the output and input sequences have the same temporal properties, which prevents the residual connection from being additive in a physical sense. To this end, we construct a PointTransform Network (PTN). In the constructed model, a Mixer Attention Layer (MAL) and a Token Sequence Aggregation (TSA) module are proposed. The proposed MAL embeds the results of channel-communication and token-communication into the sequence in the form of weights through the attention mechanism, which maintains the temporal order among tokens in the sequence during the calculation process and compensates for the limitations of the mixer layer. The proposed TSA module can integrate the depression level prediction results of all facial keypoints through the attention mechanism. Therefore, the TSA module achieves the unity of decision-level fusion and tensor vectorization. Experiments are conducted on several benchmark databases and the results demonstrate the effectiveness of our method.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call