The rapid advancements in computer vision technology present significant potential for the automatic recognition of learner engagement in E-learning. We conducted a two-stage experiment to assess learner engagement based on behavioural (external observations) and physiological (internal factors) cues. Using computer vision technology and wearable sensors, we extracted three feature sets: action, head posture and heart rate variability (HRV). Subsequently, we integrated our constructed YOLOv5s–MediaPipe behaviour detection model with a physiological detection model based on HRV to comprehensively evaluate learners’ behavioural, affective and cognitive engagement. Additionally, we developed a method and criteria for assessing distraction based on behaviour, ultimately creating a comprehensive, efficient, low-cost and easy-to-use system for the automatic recognition of learner engagement. Experimental results showed that our improved YOLOv5s model achieved a mean average precision of 92.2 %, while halving both the number of parameters and model size. Unlike other deep learning-based methods, using MediaPipe–OpenCV for head posture analysis offers advantages in real-time performance, making it lightweight and easy to deploy. Our proposed long short-term memory classifier, based on sensitive HRV metrics and their normalisation, demonstrated satisfactory performance on the test set, with an accuracy = 80 %, precision = 81 %, recall = 80 % and an F1 score = 80 %.