Current action recognition studies enjoy the benefits of two neural network branches, spatial and temporal. This work aims to extend the previous work by introducing a fusion of spatial and temporal branches to provide superior action recognition capability toward multi-label multi-class classification problems. In this paper, we propose three fusion models with different fusion strategies. We first build several efficient temporal Gaussian mixture (TGM) layers to form spatial and temporal branches to learn a set of features. In addition to these branches, we introduce a new deep spatio-temporal branch consisting of a series of TGM layers to learn the features that emerged from the existing branches. Each branch produces a temporal-aware feature that assists the model in understanding the underlying action in a video. To verify the performance of our proposed models, we performed extensive experiments using the well-known MultiTHUMOS benchmarking dataset. The results demonstrate the importance of our proposed deep fusion mechanism, contributing to the overall score while keeping the number of parameters small.