In today’s fast-paced digital landscape, the attention span of users consuming video content is alarmingly brief, often as short as 15 seconds for music or entertainment videos and 6 minutes for lecture videos. This presents a significant challenge for video producers and platform providers as they seek to engage users with longer content. One promising solution involves recommending specific fragments within longer videos that align with individual user profiles. In this paper, we address this challenge by introducing a novel framework for video fragment recommendations, guided by three key insights. First, we implement a Self-Attention Block that captures the inter-fragment contextual effect, enhancing the relevance of recommendations. Second, we incorporate video-level preferences to ensure that the fragment recommendations are consistent with users’ overall interests. Third, we propose a Self-Attentive Herding Effect (SAHE) module to model the intra-fragment contextual effect, specifically the herding effect of time-sync comments within a fragment. To evaluate the effectiveness of our proposed method, we conduct extensive experiments comparing our model against the state-of-the-art approaches in terms of NDCG@K and Recall@K. Our results demonstrate that the model effectively leverages inter-fragment, intra-fragment contextual effects, and video-level preferences, outperforming existing methods. Additionally, we carry out empirical experiments to analyze the key components and parameters of the proposed model, providing further insights into its performance.