Assessment of tunnelling-induced building damage is a complex Soil-Structure Interaction (SSI) probelm, influenced by numerous geometric and material parameters of both the soil and structures, and is characterised by strong non-linear behaviour. Currently, there is a trend towards developing data-driven models using Machine Learning (ML) to capture this complex behaviour. Given the scarcity of real data, which typically comes from specific case studies, many researchers have turned to creating extensive synthetic datasets via sophisticated and validated numerical models like Finite Element Method (FEM). However, the development of these datasets and the training of advanced ML algorithms present significant challenges. poses significant challenges. Reliance solely on parameter domains and ranges derived from case studies can lead to imbalanced data distributions and subsequently poor performance of models in less populated regions. In this paper, we introduce a strategy for designing optimal high-confidence datasets through an iterative procedure. This process begins with a systematic literature review to determine the importance of parameters, their ranges, and dependencies as they pertain to building damage induced by SSI. Starting with several hundred FEM simulations, we generate an initial dataset and assess its quality and impact through Sensitivity Analysis (SA) studies, statistical modelling, and re-sampling in statistically significant regions. This evaluation allows us to refine the model’s input space, seeking scenarios that mitigate output distribution imbalances. The procedure is repeated until the datasets achieve a satisfactory balance for training metamodels, minimising bias effectively. Our findings highlight the success of this approach in identifying an optimal and feasible input space that significantly reduces imbalanced distributions of output features. This approach not only proves effective in our study but also offers a versatile methodology that could be adapted to other disciplines aiming to generate high-quality synthetic datasets.