Predicting contextualised engagement in videos is a long-standing problem that has been popularly attempted by exploiting the number of views or <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">likes</i> using different computational methods. The recent decade has seen a boom in online learning resources, and during the pandemic, there has been an exponential rise of online teaching videos without much quality control. As a result, we are faced with two key challenges. First, how to decide which lecture videos are engaging to intrigue the listener and increase productivity, and second, how to automatically provide feedback to the content creator using which they could improve the content. The quality of the content could be improved if the creators could automatically get constructive feedback on their content. On the other hand, there has been a steep rise in developing computational methods to predict a user engagement score. In this paper, we have proposed a new unified model, CLEFT, that means “Contextualised unified Learning of user Engagement in video lectures with Feedback” that learns from the features extracted from freely available public online teaching videos and provides feedback on the video along with the user engagement score. Given the complexity of the task, our unified framework employs different pre-trained models working together as an ensemble of classifiers. Our model exploits a range of multi-modal features to model the complexity of language, context agnostic information, textual emotion of the delivered content, animation, speaker’s pitch, and speech emotions. Our results support hypothesis that proposed model can detect engagement reliably and the feedback component gives useful insights to the content creator to further help improve the content.
Read full abstract