Video summarization (VS) suppresses high-dimensional (HD) video data by only extracting only the important information. However, prior research has not focused on the need for surveillance VS, that is used for many applications to assist video surveillance experts, including video retrieval and data storage. In addition, mainstream techniques commonly use 2D deep models for VS, ignoring event occurrences. Accordingly, we present a two-fold 3D deep learning-assisted VS framework. First, we employ an inflated 3D ConvNet model to extract temporal features; these features are optimized using a proposed encoder mechanism. The input video is temporally segmented using a feature comparison technique for selecting a single frame from each video segment. The segmented shots are evaluated using our novel shot segmentation evaluation scheme and are input into a saliency computation mechanism for keyframe selection in a second fold. Qualitative and quantitative analyses over VS benchmarks and surveillance videos demonstrate the superior performance of our framework, with 0.3- and 4.2-unit increases in the F1 scores for YouTube and TVSum datasets, respectively. Along with accurate VS, a key contribution of our study is the novel shot segmentation criterion prior to VS, which can be used as a benchmark in future research to effectively prioritize HD visual data.