Comparative Study of Movie Shot Classification Based on Semantic Segmentation

Hui-Yong Bak,Seung-Bo Park

doi:10.3390/app10103390

Abstract

The shot-type decision is a very important pre-task in movie analysis due to the vast information, such as the emotion, psychology of the characters, and space information, from the shot type chosen. In order to analyze a variety of movies, a technique that automatically classifies shot types is required. Previous shot type classification studies have classified shot types by the proportion of the face on-screen or using a convolutional neural network (CNN). Studies that have classified shot types by the proportion of the face on-screen have not classified the shot if a person is not on the screen. A CNN classifies shot types even in the absence of a person on the screen, but there are certain shots that cannot be classified because instead of semantically analyzing the image, the method classifies them only by the characteristics and patterns of the image. Therefore, additional information is needed to access the image semantically, which can be done through semantic segmentation. Consequently, in the present study, the performance of shot type classification was improved by preprocessing the semantic segmentation of the frame extracted from the movie. Semantic segmentation approaches the images semantically and distinguishes the boundary relationships among objects. The representative technologies of semantic segmentation include Mask R-CNN and Yolact. A study was conducted to compare and evaluate performance using these as pretreatments for shot type classification. As a result, the average accuracy of shot type classification using a frame preprocessed with semantic segmentation increased by 1.9%, from 93% to 94.9%, when compared with shot type classification using the frame without such preprocessing. In particular, when using ResNet-50 and Yolact, the classification of shot type showed a 3% performance improvement (to 96% accuracy from 93%).

Highlights

IntroductionClose-up shots are used for expressing the emotions and psychology of the characters, with the subject occupying most of the screen
In films, movie shot types are classified based on the distance between the camera and the subject, and the general types of shots are the close-up shot, the medium shot, and the long shot [1,2].Among them, close-up shots are used for expressing the emotions and psychology of the characters, with the subject occupying most of the screen
R-convolutional neural network (CNN), Yolact uses a full range of image space without compressing the image, resulting in better semantic segmentation performance than Mask R-CNN [25]

Summary

Introduction

Close-up shots are used for expressing the emotions and psychology of the characters, with the subject occupying most of the screen. Emotion or psychology is expressed with the character’s eyes, mouth, and facial muscles by making the character’s face occupy most of the screen [3]. A portion of the character’s body below the waist or elbow is located at the bottom of the screen. Medium shots are used to express a character’s gaze direction and movement, since the character’s body above the waist appears on the screen [3]. The subject occupies about one-sixth of the screen, giving the audience information about the place (inside or outside, in an apartment, a shop, a forest, etc.) and time (day, night, season) [3]

Methods

Results

Discussion

Conclusion