Text-mined dataset of inorganic materials synthesis recipes

Olga Kononova,Tiago Botari,Wenhao Sun,Ziqin Rong,Vahe Tshitoyan,Gerbrand Ceder,Haoyan Huo,Tanjin He

doi:10.1038/s41597-019-0224-1

Abstract

Materials discovery has become significantly facilitated and accelerated by high-throughput ab-initio computations. This ability to rapidly design interesting novel compounds has displaced the materials innovation bottleneck to the development of synthesis routes for the desired material. As there is no a fundamental theory for materials synthesis, one might attempt a data-driven approach for predicting inorganic materials synthesis, but this is impeded by the lack of a comprehensive database containing synthesis processes. To overcome this limitation, we have generated a dataset of “codified recipes” for solid-state synthesis automatically extracted from scientific publications. The dataset consists of 19,488 synthesis entries retrieved from 53,538 solid-state synthesis paragraphs by using text mining and natural language processing approaches. Every entry contains information about target material, starting compounds, operations used and their conditions, as well as the balanced chemical equation of the synthesis reaction. The dataset is publicly available and can be used for data mining of various aspects of inorganic materials synthesis.

Highlights

Background & SummaryThe number of big-data-driven projects for materials discovery has been boosted significantly in the last decades due to Materials Genome Initiative efforts[1] and growth of computational tools[2,3,4,5,6]
Scientific publications have accumulated an enormous amount of information about materials, but the data is presented in unstructured and arbitrary form which significantly obstructs its use in data-driven research[17]
We provide fully auto-generated open-source dataset of 19,744 chemical reactions retrieved from 53,538 solid-state synthesis paragraphs

Summary

Background & Summary

The number of big-data-driven projects for materials discovery has been boosted significantly in the last decades due to Materials Genome Initiative efforts[1] and growth of computational tools[2,3,4,5,6]. Development of text mining and natural language processing (NLP) approaches have made it possible to implement various automated methodologies for converting scientific text into structured data collections[20,21]. Kim et al created publicly available dataset of inorganic synthesis parameters for 30 different oxides systems extracted from literature[20]. They used their data to provide guidelines for titania nanotubes synthesis[30]. Digitizing and systemizing the large corpus of existing solid-state chemistry literature enables us to make a first step toward development of data-driven approaches for understanding inorganic materials synthesis and synthesizability

Methods

Findings

Code Availability