Introduction:Advances in wearable sensor technology have enabled the collection of biomarkers that may correlate with levels of elevated stress. While significant research has been done in this domain, specifically in using machine learning to detect elevated levels of stress, the challenge of producing a machine learning model capable of generalizing well for use on new, unseen data remain. Acute stress response has both subjective, psychological and objectively measurable, biological components that can be expressed differently from person to person, further complicating the development of a generic stress measurement model. Another challenge is the lack of large, publicly available datasets labeled for stress response that can be used to develop robust machine learning models. In this paper, we first investigate the generalization ability of models built on datasets containing a small number of subjects, recorded in single study protocols. Next, we propose and evaluate methods combining these datasets into a single, large dataset to study the generalization capability of machine learning models built on larger datasets. Finally, we propose and evaluate the use of ensemble techniques by combining gradient boosting with an artificial neural network to measure predictive power on new, unseen data. In favor of reproducible research and to assist the community advance the field, we make all our experimental data and code publicly available through Github at https://github.com/xalentis/Stress. This paper’s in-depth study of machine learning model generalization for stress detection provides an important foundation for the further study of stress response measurement using sensor biomarkers, recorded with wearable technologies. Methods:Sensor biomarker data from six public datasets were utilized in this study. Exploratory data analysis was performed to understand the physiological variance between study subjects, and the complexity it introduces in building machine learning models capable of detecting elevated levels of stress on new, unseen data. To test model generalization, we developed a gradient boosting model trained on one dataset (SWELL), and tested its predictive power on two datasets previously used in other studies (WESAD, NEURO). Next, we merged four small datasets, i.e. (SWELL, NEURO, WESAD, UBFC-Phys), to provide a combined total of 99 subjects, and applied feature engineering to generate additional features utilizing statistical summaries, with sliding windows of 25 s. We name this large dataset, StressData. In addition, we utilized random sampling on StressData combined with another dataset (EXAM) to build a larger training dataset consisting of 200 synthesized subjects, which we name SynthesizedStressData. Finally, we developed an ensemble model that combines our gradient boosting model with an artificial neural network, and tested it using Leave-One-Subject-Out (LOSO) validation, and on two additional, unseen publicly available stress biomarker datasets (WESAD and Toadstool). Results:Our results show that previous models built on datasets containing a small number (<50) of subjects, recorded in single study protocols, cannot generalize well to new, unseen datasets. Our presented methodology for generating a large, synthesized training dataset by utilizing random sampling to construct scenarios closely aligned with experimental conditions demonstrate significant benefits. When combined with feature-engineering and ensemble learning, our method delivers a robust stress measurement system capable of achieving 85% predictive accuracy on new, unseen validation data, achieving a 25% performance improvement over single models trained on small datasets. The resulting model can be used as both a classification or regression predictor for estimating the level of perceived stress, when applied on specific sensor biomarkers recorded using a wearable device, while further allowing researchers to construct large, varied datasets for training machine learning models that closely emulate their exact experimental conditions. Conclusion:Models trained on small, single study protocol datasets do not generalize well for use on new, unseen data and lack statistical power. Machine learning models trained on a dataset containing a larger number of varied study subjects capture physiological variance better, resulting in more robust stress detection. Feature-engineering assists in capturing these physiological variance, and this is further improved by utilizing ensemble techniques by combining the predictive power of different machine learning models, each capable of learning unique signals contained within the data. While there is a general lack of large, labeled public datasets that can be utilized for training machine learning models capable of accurately measuring levels of acute stress, random sampling techniques can successfully be applied to construct larger, varied datasets from these smaller sample datasets, for building robust machine learning models.