Directly applying big language models for material and molecular design is not straightforward, particularly for real-scenario cases, where experimental validation accuracy is required. In this study, we propose a multimode descriptor design method for materials prediction and analysis, leveraging the advantages of the natural language processing literature model and density functional theory (DFT) calculations with the assistance of the genetic algorithm (GA). A case study on prediction of aqueous photocurrents of multisolvent engineered halide perovskite CH3NH3PbI3 is performed, and the following-up validation experiments are carried out to demonstrate the improved accuracy of the multimode descriptors (an unprecedented experimental validation accuracy of 87.5% via the GA is achieved) for predicting aqueous photocurrents of perovskite materials (c.f. only 50% experimental accuracy for other common machine learning models). The improved experimental accuracy of the descriptors is attributed to the successful deployment of a language model incorporating concise scientific information from >1 million articles into molecular descriptors in combination with DFT calculations. The subsequent machine learning analysis suggests the importance of cation···π and crystallization in molecule-modified halide perovskite materials representing ontological and conceptual understanding. Importantly, the genetic process affords an accurate "white-box" model to describe the perovskite stability (accuracy = 90.2% for the test data set and 92.3% for the train data set) with the mathematical equation , where F1 ∼ F5 atomic-level structural and chemical details such as cation···π interactions and highest occupied molecular orbital levels. This study offers a feasible descriptor design route to accurately predict complex material properties, leveraging both language models and density functional theories.
Read full abstract