Artificial intelligence techniques offer promising avenues for exploring human body features from videos, yet no freely accessible tool has reliably provided holistic and fine-grained behavioral analyses to date. To address this, we developed a machine learning tool based on a two-level approach: a first lower-level processing using computer vision for extracting fine-grained and comprehensive behavioral features such as skeleton or facial points, gaze, and action units; a second level of machine learning classification coupled with explainability providing modularity, to determine which behavioral features are triggered by specific environments. To validate our tool, we filmed 16 participants across six conditions, varying according to the presence of a person ("Pers"), a sound ("Snd"), or silence ("Rest"), and according to emotional levels using self-referential ("Self") and control ("Ctrl") stimuli. We demonstrated the effectiveness of our approach by extracting and correcting behavior from videos using two computer vision software (OpenPose and OpenFace) and by training two algorithms (XGBoost and long short-term memory [LSTM]) to differentiate between experimental conditions. High classification rates were achieved for "Pers" conditions versus "Snd" or "Rest" (AUC = 0.8-0.9), with explainability revealing actions units and gaze as key features. Additionally, moderate classification rates were attained for "Snd" versus "Rest" (AUC = 0.7), attributed to action units, limbs and head points, as well as for "Self" versus "Ctrl" (AUC = 0.7-0.8), due to facial points. These findings were consistent with a more conventional hypothesis-driven approach. Overall, our study suggests that our tool is well suited for holistic and fine-grained behavioral analysis and offers modularity for extension into more complex naturalistic environments.
Read full abstract