Organometallic compounds (OMCs) have attracted tremendous attention in various fields, such as photovoltaic cell and high-k dielectric application, due to their beneficial properties. Despite their potential, the progression of OMCs into industrial applications is hindered by the limited databases available for their properties and the absence of efficient surrogate models. To address this, in this study, optimally selected feature-based surrogate models for predicting the electronic properties of OMCs are constructed via various multiscale features and extensive database. To this end, high-throughput calculation was performed to obtain electronic properties of more than 18k materials generally known as organometallics, augmenting around 12k organic materials obtained from the public open data set, OMDB-GAP1. For generating features closely related to OMCs, descriptors encapsulating the information ranging local to global, also other widely-used composition-, structure-based features (more than 3.5k in total) were employed. Among these descriptors, we identified 48 critical features that elucidates the physicochemical underpinnings of OMCs’ properties, suggesting their impact on the properties of OMCs. The light gradient boosting machine model achieved high-accuracy predictions across the entire database with just 1 % of the total descriptors, sufficiently compared to the entire sets (decreased of around 0.01 by R2 score and 0.01 eV by MAE). Furthermore, the efficacy of active learning process was demonstrated to find OMCs with optimal properties rapidly. As a result, expected improvement outperforms other methods by identifying 69 % of the target materials only searching 46 % of the total search space. Our constructed platform with a high-throughput calculated database can pave the way for the rapid screening of OMCs for the targeted industrial application, and suggest a comprehensive grasp of the intrinsic properties of OMCs and related compounds.
Read full abstract