Extracting and analyzing inorganic material synthesis procedures in the literature
Abstract Analyzing material synthesis procedures in the literature is required to collect structural information of material names and synthesis procedures for designing materials computationally. Since synthesis procedures are mostly written in natural language in paper or technical documents, they need to be extracted and structured into a format that can be handled by a computer through information extraction. Moreover, to represent a synthesis procedure, it is necessary to express information such as conditions and the order of operations in the procedure, but existing databases that compile structural information of material names and synthesis procedures of materials do not provide such information about procedures. It is, therefore, necessary to create a framework that extracts and organizes the information of synthesis procedures in text so that the information is enough for material development such as the order of operations and the links among materials, operations, and conditions. In this study, we construct a pipeline system that extracts synthesis procedures from a text in the form of a flow graph. The extraction system consists of preprocessing, deep learning-based entity extraction, rule-based relation extraction, and selection for paragraph-containing procedures. We applied the system to a large body of literature and extracted flow graphs (procedures) that include about 4 million entities and 3 million relations. We took several statistics on the extracted graphs and performed several analyses on the extracted graphs. We experimentally confirmed that some extracted operations were specific to the target material and the frequently extracted sub-graphs include reasonable operations.