Generalizing real-world data has been one of the most difficult challenges for application of machine learning (ML) in practice. Most ML works focused on improvements in algorithms and feature representations. However, the data quality, as the foundation of ML, has been largely overlooked, also leading to the absence of data evaluation and processing methods in ML fields. Motivated by the challenge and need, we selected an important but difficult reorganization energy (RE) prediction task as a test platform, which is an important parameter for the charge mobility of organic semiconductors (OSCs), to propose a data-quality-navigated strategy with chemical intuition. We developed a data diversity evaluation based on structure characteristics of OSC molecules, a reliability evaluation method based on prediction accuracy, a data filtering method based on the uncertainty of K-fold division, and a data split technique by clustering and stratified sampling based on four molecular descriptor-associated REs. Consequently, a representative RE data set (15,989 molecules) with high reliability and diversity can be obtained. For the feature representation, a complementary strategy is proposed by considering the chemical nature of REs and the structure characteristics of OCS molecules as well as the model algorithm. In addition, an ensemble framework consisting of two deep learning models is constructed to avoid the risk of local optimization of the single model. The robustness and generalization of our model are strongly validated against different OSC-like molecules with diverse structures and a wide range of REs and real OSC molecules, greatly outperforming eight adversarial controls. Collectively, our work not only provides a quick and reliable tool to screen efficient OSCs but also offers methodological guidelines for improving the generalization of ML.
Read full abstract