Several Particle Filter (PF)-based methods for human tracking in thermal IR image sequences have been proposed in the literature. Unfortunately, the majority of these methods are developed for tracking only a single human. Moreover, this human is manually pre-selected in the first frame of the image sequence, which is not practical for the real case of intelligent and efficient video surveillance system that needs tracking more than one human and without any external operator intervention. To contribute to addressing this need, in this paper, we propose a novel PF-based method that detects and tracks multiple moving humans using a thermal IR camera, without prior knowledge about their number and initial locations in the monitored scene. This method consists of three main parts. In the first one, all the moving objects are extracted from the image sequence by using the Gaussian Mixture Model (GMM) and then, for each extracted object, a combined shape, appearance, spatial and temporal-based similarity function that allows us to detect a human without any prior training of a mathematical model is calculated. The second part consists in tracking the human previously detected by using a PF and an adaptive combination of spatial, intensity, texture and motion velocity cues. In each cue, a model for the detected human is created, and when new observations arrive in the next frames, the similarity distances between each created model and the observed moving regions are calculated. The human tracking is achieved by combining individual similarity distances using adaptive weights, into a PF algorithm. The third part is devoted to detect and handle occlusions by using simple heuristic rules and grayscale Vertical Projection Histogram (VPH). Each part of the proposed method was separately tested on a set of real-world thermal IR image sequences containing background clutters, appearance and disappearance of multiple moving objects, occlusions, illumination and scales changes. A comparative study with several state-of-the-art methods has shown that the proposed method performs consistently better in terms of Center Location Error (CLE) and the Success Rate (SR), and it can also run at speed of about 15 Frame Per Second per human, which is considerably enough for real-time applications.