Abnormal crowd behavior detection is an important research issue in computer vision. However, complex real-life situations (e.g., severe occlusion, over-crowding, etc.) still challenge the effectiveness of previous algorithms. Recently, the methods based on spatio-temporal cuboid are popular in video analysis. To our knowledge, the spatio-temporal cuboid is always extracted randomly from a video sequence in the existing methods. The size of each cuboid and the total number of cuboids are determined empirically. The extracted features either contain the redundant information or lose a lot of important information which extremely affect the accuracy. In this paper, we propose an improved method. In our method, the spatio-temporal cuboid is no longer determined arbitrarily, but by the information contained in the video sequence. The spatio-temporal cuboid is extracted from video sequence with adaptive size. The total number of cuboids and the extracting positions can be determined automatically. Moreover, to compute the similarity between two spatio-temporal cuboids with different sizes, we design a novel data structure of codebook which is constructed as a set of two-level trees. The experiment results show that the detection rates of false positive and false negative are significantly reduced. Keywords: Codebook, latent dirichlet allocation (LDA), social force model, spatio-temporal cuboid.