Abstract

Materials informatics requires large-scale collection and analysis of material synthesis procedures described in the literature for designing materials using computational methods. However, existing studies have not performed the paragraph-level analysis of the procedures. Moreover, since most of the synthesis procedures are described in natural language in articles and technical documents, it is necessary to structure them in a format that can be handled by computers through information extraction. Therefore, in this study, we construct a pipeline system that extracts synthesis procedures from text in the form of a flow graph and analyzes each procedure as a flow graph rather than a set of processes. The extraction system extracts entities by the deep learning model and relations between entities by the rule-based extractor from all paragraphs in the literature and selects procedures that include valid structures of entities and relations. Our evaluation of a benchmark dataset gave micro-averaged F-scores of 0.807, 0.830, and 0.609 for the entity extractor, relation extractor, and pipeline extractor, respectively. We applied this system to a large amount of literature and extracted approximately 90,000 flow graphs (procedures) containing approximately 4 million entities and 3 million relations. We performed several analyses, including taking statistics of the extracted graphs and checking frequent subgraphs for the extracted graphs. Commonly used methods in materials science were confirmed from our analyses; for example, ethanol is often dried by heating at 60 °C, and less-reactive noble gases are rarely included in the products. As a result, we experimentally confirmed that the extracted procedures were reasonable.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call