Abstract

There is growing interest in applying machine learning techniques in the research of materials science. However, although it is recognized that materials datasets are typically smaller and sometimes more diverse compared to other fields, the influence of availability of materials data on training machine learning models has not yet been studied, which prevents the possibility to establish accurate predictive rules using small materials datasets. Here we analyzed the fundamental interplay between the availability of materials data and the predictive capability of machine learning models. Instead of affecting the model precision directly, the effect of data size is mediated by the degree of freedom (DoF) of model, resulting in the phenomenon of association between precision and DoF. The appearance of precision–DoF association signals the issue of underfitting and is characterized by large bias of prediction, which consequently restricts the accurate prediction in unknown domains. We proposed to incorporate the crude estimation of property in the feature space to establish ML models using small sized materials data, which increases the accuracy of prediction without the cost of higher DoF. In three case studies of predicting the band gap of binary semiconductors, lattice thermal conductivity, and elastic properties of zeolites, the integration of crude estimation effectively boosted the predictive capability of machine learning models to state-of-art levels, demonstrating the generality of the proposed strategy to construct accurate machine learning models using small materials dataset.

Highlights

  • In the past few decades the substantial advancement of machine learning (ML) has spanned the application of this data driven approach throughout science, commerce, and industry.[1]

  • We propose a solution to improve the accuracy without causing higher degree of freedom (DoF) by incorporating the crude estimation of property (CEP) in the feature space

  • All these studies utilized the available dataset of around 100 examples, which in our opinion represented a lower limit to apply ML in materials research. These studies varied in terms of data source, method to obtain CEP, the algorithm to select appropriate features, and regression method, in the vicinity of including CEP as a descriptor the predictive capability was effectively boosted with scaled error well below the trend observed in the aforementioned survey, demonstrating the capability of the proposed strategy in constructing accurate ML models with small available materials data

Read more

Summary

Introduction

In the past few decades the substantial advancement of machine learning (ML) has spanned the application of this data driven approach throughout science, commerce, and industry.[1]. Lee et al examined the ML models for band gaps of inorganic compounds and found the predicting accuracy converged for the ordinary least-square regression and LASSO models at certain sizes of training set, while for the support vector machine model the error still slowly decreased at the largest dataset in their study. While these studied unambiguously demonstrated that the less availability of training data renders the detection of patterns more difficult and deteriorates the capability of making prediction in the unexplored domain, the role of materials dataset in constructing ML model has not been systematically investigated to the best of our knowledge. The possibility to establish accurate predictive rules using small available materials datasets remains unclear

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call