Because of its wide application value, behavior recognition has long been one of the research hots pots in the field of computer vision and pattern recognition. At present, the method based on local features and word packet model has been widely used in the field of behavior recognition. However, this method does not consider the temporal and spatial relationship between features, and the local temporal and spatial relationship between features is very important for behavior representation and behavior recognition. In view of the above problems, this paper proposes a modeling method of character behavior recognition based on local spatio-temporal relationship in surveillance video. Firstly, each part of the proposed network model is introduced in detail, and then the proposed model is compared with the advanced skeleton action recognition methods in recent years on several skeleton data sets. Finally, the effectiveness of the proposed method is verified. The experimental results show that, compared with the recognition results of related literatures, the features extracted by choosing the starting point of trajectory have better recognition performance under the fusion framework.