Abstract

Imbalanced data sets in materials informatics are pervasive and pose a challenge to the development of classification models. This work investigates crystal point group prediction as an example of an imbalanced classification problem in materials informatics. Multiple resampling and classification techniques were considered. The findings suggest that the most influential variable of the resampling algorithms is the one controlling the number of samples to omit (undersample) or synthetically generate (oversample), as expected. The effect of balancing is to enhance the classification performance of the minority class at the cost of reducing the correct predictions of the majority class. Moreover, ideal balancing, where the classes are precisely balanced, is not optimum. Alternatively, partial balancing should be performed. In this study, the ideal ratio of the minority to majority class was found to be around two-thirds. The biggest improvement in the classification was for the random undersampling technique with k-nearest neighbors and random forest.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.