Group activity recognition aims to recognize behaviors characterized by multiple individuals within a scene. Existing schemes rely on individual relation inference and usually take the individuals as tokens. Essentially they select the most relevant region of the group activity from the entire image while filtering out irrelevant background noises. However, these schemes require individual bounding box labeling in both training and testing stages. Since individuals have usually been presented at one scale, multi-scale individuals cannot be combined in an effective way. In this paper, we present a novel end-to-end hierarchical relation inference framework based on active spatial positions for group activity recognition. This framework is designed to locate active spatial positions and use them as visual tokens to infer the relations for token embeddings. It requires individual bounding box labeling only in the training stage while automatically eliminating the background after locating active spatial positions from the entire scene. The hierarchical relations can be naturally inferred based on the visual tokens at different scales, contributing to further performance improvement. Experimental results demonstrate that the proposed framework is competitive against existing schemes that require more laboring and computation to generate labels in both the training and testing stage.