Machine learning (ML) models have received increasing attention as a new approach for the virtual screening of organic materials. Although some ML models trained on large databases have achieved high prediction accuracy, the application of ML to certain types of organic materials is limited by the small amount of available data. On the other hand, metalloporphyrins and porphyrins (MpPs) have received increasing attention as potential photocatalysts, and recent studies have found that both HOMO/LUMO energy levels and energy gaps are important factors controlling the MpP photocatalysts. Since the training data of MpPs are insufficient and limited to porphyrin-based dyes, in this study, we proposed a deep transfer learning approach to rapidly predict the HOMO/LUMO energy levels and energy gaps of MpPs. To complement the open-source Porphyrin-based Dyes Database (PBDD), we curated a new database, the Metalloporphyrins and Porphyrins Database (MpPD), in which MpPs were specifically designed as potential photocatalysts and the HOMO/LUMO energies were calculated by advanced DFT functionals. We proposed PorphyBERT, a BERT-based regression model that was pre-trained with PBDD and fine-tuned with MpPD. The model performed satisfactorily in predicting HOMO and LUMO energies and energy gap with RMSEs of 0.0955, 0.0988, and 0.0787 eV and MAEs of 0.0774, 0.0824, and 0.0549 eV. Furthermore, due to its unique unsupervised pre-training phase, the model is not affected by the difference in computational functionals between pre-training and fine-tuning databases. Finally, we recommended 12 MpPs as potential photocatalysts for CO2 reduction with out-of-sample model predictions of energy gaps close to the values calculated by DFT.
Read full abstract