Smart video surveillance plays a significant role in public security via storing a huge amount of continual stream data, evaluates them, and generates warns where undesirable human activities are performed. Recognition of human activities in video surveillance has faced many challenges such as optimal evaluation of human activities under growing volumes of streaming data with complex computation and huge time processing complexity. To tackle these challenges we introduce a lightweighted spatial-deep features integration using multilayer GRU (SDIGRU). First, we extract spatial and deep features from frames sequence of realistic human activity videos via utilizing a lightweight MobileNetV2 model and then integrate those spatial-deep features. Although deep features can be used for human activity recognition, they contain only the high-level appearance, which is insufficient to correctly identify the particular activity of human. Thus, we jointly apply deep information with spatial appearance to produce detailed level information. Furthermore, we select rich informative features from spatial-deep appearances. Then, we train multilayer gated recurrent unit (GRU) and feed informative features to learn the temporal dynamics of human activity frames sequence at each time step of GRU. We conduct our experiments on benchmark YouTube11, HMDB51, and UCF101 datasets of human activity recognition. The empirical results show that our method achieved significant recognition performance with low computational complexity and quick response. Finally, we compare the results with existing state-of-the-art techniques, which show the effectiveness of our method.