Recognizing individuals of interest from faces captured with video cameras raises several challenges linked to changes in capture conditions (e.g., variation in illumination and pose). Moreover, in person re-identification applications, the facial models needed for matching are typically designed a priori, with a limited amount of reference samples captured under constrained temporal and spatial conditions. Tracking can, however, be used to regroup the system responses linked to a facial trajectory (facial captures from a person) for robust spatio-temporal recognition, and to update facial models over time using operational data. In this paper, an adaptive ensemble-based system is proposed for spatio-temporal face recognition (FR). Given a diverse set of facial captures in a trajectory of a target individual, an ensemble of 2-class classifiers is designed. A pool of ARTMAP classifiers is generated using a dynamic PSO-based learning strategy, and classifiers are selected and combined using Boolean combination. To train classifiers, target samples are combined with a set of reference non-target samples selected from the cohort and universal models using One-Sided Selection. During operations, facial trajectories are captured, and each individual-specific ensemble of the system seeks to detect target individuals, and possibly self-update their facial models. To update an ensemble, a learn-and-combine strategy is employed to avoid knowledge corruption, and a memory management strategy based on Kullback---Leibler divergence allows to rank and select stored validation samples over time to bound the system's memory consumption. Spatio-temporal fusion is performed by accumulating classifier predictions over a time window, and a second threshold allows to self-update facial models. The proposed systems were validated with videos from the Face in Action and COX-S2V datasets, that feature both abrupt and gradual patterns of change. At the transaction level, results show that the proposed system allows to increase AUC accuracy by about 3 % for scenarios with abrupt changes, and by about 5 % with gradual changes. Subject-based analysis reveals the difficulties of face recognition with different poses, affecting more significantly the lamb- and goat-like individuals. Compared to reference spatio-temporal fusion approaches, results show that the proposed accumulation scheme produces the highest discrimination.