Abstract

Data quality greatly affects the performance of machine learning models for predicting gravelly soil liquefaction. Therefore, this study quantifies the impact of data quality dimensions (uncertainty, uniqueness, and outliers) on the performance of six gravelly soil liquefaction models and provides strategies for data collection and selection before modeling. The results demonstrate that the presence of outliers in the liquefaction database reduces the generalization performance of the prediction model. However, duplicate samples in the training database do not have a negative impact on the generalization performance when the repetition rate is below 10%. Additionally, the uncertainty of liquefaction samples was assessed using the coefficient of variation, which classified the samples into four classes (classes A, B, C, and D) representing varying degrees of uncertainty in this study. By maintaining a balance between data diversity and uncertainty, good generalization performance of the liquefaction model can be guaranteed when class A comprises 10% to 20%, class B comprises 70% to 80%, and class C comprises 5% to 10% of the total samples. However, the best generalization performance of the liquefaction model is achieved when only considering classes A and B, with a sample size ratio of approximately 1:1. This study can provide the liquefaction sample selection strategies before constructing the liquefaction model and improve the generalization performance of the model.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call