Applying machine-learning techniques for imbalanced data sets presents a significant challenge in materials science since the underrepresented characteristics of minority classes are often buried by the abundance of unrelated characteristics in majority of classes. Existing approaches to address this focus on balancing the counts of each class using oversampling or synthetic data generation techniques. However, these methods can lead to loss of valuable information or overfitting. Here, we introduce a deep learning framework to predict minority-class materials, specifically within the realm of metal-insulator transition (MIT) materials. The proposed approach, termed boosting-CGCNN, combines the crystal graph convolutional neural network (CGCNN) model with a gradient-boosting algorithm. The model effectively handled extreme class imbalances in MIT material data by sequentially building a deeper neural network. The comparative evaluations demonstrated the superior performance of the proposed model compared to other approaches. Our approach is a promising solution for handling imbalanced data sets in materials science.
Read full abstract