Research on automatic emotion recognition from speech has recently focused on the prediction of time-continuous dimensions (e.g., arousal and valence) of spontaneous and realistic expressions of emotion, as found in real-life interactions. However, the automatic prediction of such emotions poses several challenges, such as the subjectivity found in the definition of a gold-standard from a pool of raters and the issue of data scarcity in training models. In this work, we introduce a novel emotion recognition system, based on ensembles of single-speaker-regression-models. The estimation of emotion is provided by combining a subset of the initial pool of single-speaker-regression-models selecting those that are most concordant among them. The proposed approach allows the addition or removal of speakers from the ensemble without the necessity to re-build the entire recognition system. The simplicity of this aggregation strategy, coupled with the flexibility assured by the modular architecture, and the promising results observed on the RECOLA database highlight the potential implications of the proposed method in a real-life scenario and in particular in web-based applications.