Abstract

During this digital era, a large amount of visual data is being generated by various multimedia sources. A technique is urgently required to provide a clear and accurate summary of the original video to highlight the most informative segments of the video content. However, the varying video resolution, multi-dimensional feature representation, and massive storage create difficulties in key frame extraction techniques. As a result, many unnecessary features of the videos must be dropped to count their unique qualities. Therefore, a novel deep learning-based approach is proposed where frames are first retrieved using 25 frames per second; then, objects are detected on extracted frames using YOLOv5, and the frames with the target object only are processed further to overcome time consumption and high-speed computing hardware limitations. Further features are obtained using VGG-16 and object of Interest (OoI) based ResNet-50, respectively, and comparisons are performed to find the best solution. The extracted features are compressed using Principal Component Analysis (PCA) based on unsupervised Learning, which may efficiently minimize information loss, and potentially reduce dimension. By performing a comprehensive evaluation to obtain the best value of K using the Silhouette score, Candidate frames are extracted with the maximum mean and standard deviation from each K-means algorithm-based cluster. Pearson Correlation Coefficient (PCC) is used as a post- processing step to remove the redundant frames from the candidate frames and final keyframes extraction. The experiment was performed on the benchmark office dataset from industrial surveillance, which outperforms the state-of-the-art models in terms of recall score.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call