Guided wave ultrasound field measurements capture reflections from aluminothermic welds in the rail track at unknown distances from a transducer. The measured guided wave signals are complex and challenging to interpret due to multiple dispersive modes propagating in the rail, changing environmental conditions, and noise. Data-driven machine learning techniques have been applied to complex signal-processing problems and have shown significant potential in learning and resolving complex problems. This study aims to understand the implications of using various simulated data sets for application within an appropriate learning framework to capture the underlying features of the field measurements and maximise the performance of these techniques. The use of simulated spectrograms, including variations of signal attributes (attenuation, positions of welds, noise, and mode reflection coefficients) generally observed in experimental measurements, allows the reflections of individual modes to be isolated or combined for training. This allows us to present the training data in three distinct forms. The first dataset has the highest mode reflection information density per sample and consists of simulated spectrogram data with multiple reflections of modes from multiple welds, like experimentally obtained spectrograms. The second training dataset consists of spectrograms with multiple mode reflections; however, only for a single weld reflection per spectrogram. The third training set contains a reflection of a single mode from a single weld in each spectrogram. The data-driven models applied are principal component autoregression and variational auto-encoders. The reconstruction error and latent space interpretability were considered as metrics for the algorithms’ ability to learn using test sets, i.e., unseen data. The results show that datasets with sufficient feature variation and higher mode reflection information density better construct the test set of simulated and experimental spectrograms. However, training using the third dataset shows more interpretable latent variables for an artificial growing defect attached to the rail. Furthermore, data-driven machine learning methods trained using simulated spectrogram data are useful for reconstructing and learning features from experimental measurements, provided that the training data have representative mode feature variation and noise.