We present a bootstrapping framework to simultaneously improve multi-person tracking and activity recognition at individual, interaction and social group activity levels. The inference consists of identifying trajectories of all pedestrian actors, individual activities, pairwise interactions, and collective activities, given the observed pedestrian detections. Our method uses a graphical model to represent and solve the joint tracking and recognition problems via three stages: (i) activity-aware tracking, (ii) joint interaction recognition and occlusion recovery, and (iii) collective activity recognition.This full-stack problem induces great complexity in learning the representations for the sub-problems at each stage, and the complexity increases as with more stages in the system. Our solution is to make use of symbolic cues for inference at higher stages, inspired by the observations of cohesive clusters at different stages. This also avoids learning more ambiguous representations in the higher stages.High-order correlations among the visible and occluded individuals, pairwise interactions, groups, and activities are then solved using the cohesive cluster search within a Bayesian framework. Experiments on several benchmarks show the advantages of our approach over the existing methods.