Movie scene segmentation aims to automatically segment a movie into multiple story units, i.e., scenes, each of which is a series of semantically coherent and time-continual shots. Previous methods have continued efforts on shot semantic association, but few take into account the impact of different semantics on foreground characters and background scenes in movie shots. In particular, the background scene in the shot can adversely affect scene boundary classification. Motivated by the fact that it is the characters who drive the plot development of a movie scene, we build a Character Attention Network (CANet) to detect movie scene boundaries in a character-centric fashion. To eliminate the background clutter, we extract multi-view character semantics for each shot in terms of human bodies and faces. Furthermore, we equip our CANet with two stages of character attention. The first is Masked Shot Attention (MSA) through selective self-attention over similar temporal contexts from multi-view character semantics to yield an enhanced omni-view shot representation, by which the CANet can better handle the variations of characters in pose and appearance. The second is Key Character Attention (KCA) through temporal-aware attention on character reappearances for Bidirectional Long Short-Term Memory (Bi-LSTM) feature association so that linking shots can be focused on those with recurring key characters. We encourage the proposed CANet in learning boundary-discriminative shot features. Specifically, we formulate a Boundary-Aware circle Loss (BAL) to push far apart CANet-features between adjacent scenes, which is also coupled with the cross-entropy loss to drive CANet-features sensitive to scene boundaries. Experimental results on the MovieNet-SSeg and OVSD datasets show that our method achieves superior performance in temporal scene segmentation compared with state-of-the-art methods.
Read full abstract