Abstract

Scientific literature, as an important tool to present research results, contains highly valuable information. Thus, there is an urgent need for mining methods to obtain key information from unstructured literature. However, existing literature mining efforts often ignore non-textual components which contain more detailed key information, such as tables. In this study, we propose a method for information processing to extract and analyze textual and tabular information from the large-scale literature of materials science. First, we propose a SciBERT-Fasttext-BiLSTM-CRF (SFBC) model for Named Entity Recognition (NER) in materials science literature, which combines generic dynamic word vectors (GDWVs) with domain-specific static word vectors (DSWVs). Second, a method is presented to extract material names, units, and compositions from the tables in literature. Compared to other table recognition methods, the method excels at extracting structured material composition information. Furthermore, Gradient Boosting Decision Tree algorithm is used to predict property trends including corrosion resistance, ductility, strength, and hardness on the basis of the material compositions, methods, properties and their changes extracted from texts and tables. The proposed method can be applied to predict material properties. Finally, we use stainless steel as an experiment example to validate our method. From 11,058 scientific papers on stainless steel, 2.36 million material entities and 7,970 material compositions are extracted. The extraction results are filtered and applied to predict four property trends of stainless steel. The method proposed in this paper can improve the accuracy of large-scale data extraction from material science literature, and the results can be used to guide the optimization of material properties and accelerate the pace of data-driven material design.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call