Video understanding is an important goal of several computer vision problems. To achieve this goal, a video is decomposed into a set of key components and the interactions between the components are modeled. Human action recognition is a challenging example of video understanding in the field of computer vision. Modeling a vocabulary of local image features in a bag of visual words (BoW) is a common approach to extract the components of an action video. Since in a video recognition task, there is no direct mapping from the raw features to class label, higher level visual descriptors and indeed, more accurate dictionaries are required. Therefore, in order to extract intrinsic shape bases and to consider temporal structure of an action, in this paper we take the advantages of group sparse coding methods. In our proposed BoW method each video is represented as a histogram of the coefficients obtained from group sparse coding. The main contribution of this study is to explore the geometry of action components via structured sparse coefficients of visual words in a real-time manner. In comparison with the conventional BoW models, our proposed approach has other advantages including much less quantization error, higher level feature representation which leads to reduction in model parameters and memory complexity while considering temporal structure. We evaluate our method on standard human action datasets including KTH, Weismann, UCF-sports and UCF50 human action datasets. The experimental results are significantly improved in comparison with previously presented results methods.