Abstract

AbstractA representative dataset is crucial to build a robust and generalized machine learning model, especially for small databases. Correlation is not usually considered in distance‐based set partition methods; therefore, distant yet correlated samples might be incorrectly assigned. An improved sample subset partition method based on joint hybrid correlation and diversity x‐y distances (HSPXY) is proposed in the framework of the sample set partition based on joint x‐y distances (SPXY). Therein, a hybrid distance consisting of both cosine angle distance and Euclidean distance in variable spaces cooperates the correlation of samples in the distance‐based set partition method. To compare with some existing partition methods, partial least squares (PLS) regression models are built on four set partition methods, random sampling (RS), Kennard‐Stone (KS), SPXY, and HSPXY. Upon the applications on small chemical databases, the proposed HSPXY algorithm‐based models achieved smaller root mean square errors and better coefficients of determination than other tested set partition methods, which indicates the training set is well represented. This suggests the proposed algorithm provides a new option to obtain a representative calibration set. Sample subset partition is widely considered in machine learning modeling. An improved sample subset partition method based on a hybrid correlation and diversity x‐y distance (HSPXY) is proposed in the framework of SPXY. Cosine angle distance and Euclidean distance in variable spaces are used to represent the correlation and diversity of samples, respectively. To explore the effectiveness of HSPXY, PLS models are built on four set partition methods, RS, KS, SPXY, and HSPXY. The models based on the proposed HSPXY algorithm carried the overall best result among all regression models, which suggests the proposed algorithm may be taken as an alternative to other existing data partition methods.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call