Understanding and predicting viewers’ emotional responses to videos has emerged as a pivotal challenge due to its multifaceted applications in video indexing, summarization, personalized content recommendation, and effective advertisement design. A major roadblock in this domain has been the lack of expansive datasets with videos paired with viewer-reported emotional annotations. We address this challenge by employing a deep learning methodology trained on a dataset derived from the application of System1’s proprietary methodologies on over 30,000 real video advertisements, each annotated by an average of 75 viewers. This equates to over 2.3 million emotional annotations across eight distinct categories: anger, contempt, disgust, fear, happiness, sadness, surprise, and neutral, coupled with the temporal onset of these emotions. Leveraging 5-second video clips, our approach aims to capture pronounced emotional responses. Our convolutional neural network, which integrates both video and audio data, predicts salient 5-second emotional clips with an average balanced accuracy of 43.6%, and shows particularly high performance for detecting happiness (55.8%) and sadness (60.2%). When applied to full advertisements, our model achieves a strong average AUC of 75% in determining emotional undertones. To facilitate further research, our trained networks are freely available upon request for research purposes. This work not only overcomes previous data limitations but also provides an accurate deep learning solution for video emotion understanding.