Extracting and analyzing inorganic material synthesis procedures in the literature

Abstract Analyzing synthesis procedures from a considerable amount of literature is required to collect structural information of material names and synthesis procedures for designing materials computationally. There are databases comprising structural data of material names and material synthesis procedures. However, the types of material in these databases and the material property values included are limited and insufficient. Moreover, they are primarily described in the literature in the natural language of the researcher who proposed the procedure, and thus they cannot be understood universally. It is, therefore, necessary to create a framework that represents textual synthesis procedures in a flow graph that contains crucial information for material development such as the order of operations and the linkage between operations and conditions. This will facilitate obtaining material insights from the literature. However, there are no large-scale studies on the extraction of synthesis procedures in the form of a graph and analysis thereof. In this study, we propose a pipeline system that extracts synthesis procedures from a text in the form of a graph with a clear order of operations and objects of conditioning from the literature. The system consists of preprocessing, entity extraction, which is based on Mat-ELMo and Bi-LSTM-CRF models, rule-based relation extraction, and selection for paragraph-containing procedures. We applied the system to a large body of literature and extracted various synthesis procedures. We performed basic analyses of the extracted procedures to examine their usability. We experimentally confirmed that some extracted procedures were specific to the target material, and some of the obvious procedures were correctly extracted.

Download Full-text

Extracting and analyzing inorganic material synthesis procedures in the literature

10.21203/rs.3.rs-636735/v2 ◽

2021 ◽

Author(s):

Kohei Makino ◽

Fusataka Kuniyoshi ◽

Jun Ozawa ◽

Makoto Miwa

Keyword(s):

Structural Information ◽

Large Body ◽

Target Material ◽

Relation Extraction ◽

Pipeline System ◽

Entity Extraction ◽

Material Synthesis ◽

Material Development ◽

Express Information ◽

Flow Graphs

Abstract Analyzing material synthesis procedures in the literature is required to collect structural information of material names and synthesis procedures for designing materials computationally. Since synthesis procedures are mostly written in natural language in paper or technical documents, they need to be extracted and structured into a format that can be handled by a computer through information extraction. Moreover, to represent a synthesis procedure, it is necessary to express information such as conditions and the order of operations in the procedure, but existing databases that compile structural information of material names and synthesis procedures of materials do not provide such information about procedures. It is, therefore, necessary to create a framework that extracts and organizes the information of synthesis procedures in text so that the information is enough for material development such as the order of operations and the links among materials, operations, and conditions. In this study, we construct a pipeline system that extracts synthesis procedures from a text in the form of a flow graph. The extraction system consists of preprocessing, deep learning-based entity extraction, rule-based relation extraction, and selection for paragraph-containing procedures. We applied the system to a large body of literature and extracted flow graphs (procedures) that include about 4 million entities and 3 million relations. We took several statistics on the extracted graphs and performed several analyses on the extracted graphs. We experimentally confirmed that some extracted operations were specific to the target material and the frequently extracted sub-graphs include reasonable operations.

Download Full-text

Heterologous Expression and Purification Systems for Structural Proteomics of Mammalian Membrane Proteins

Comparative and Functional Genomics ◽

10.1002/cfg.218 ◽

2002 ◽

Vol 3 (6) ◽

pp. 511-517 ◽

Cited By ~ 11

Author(s):

Isabelle Mus-Veteau

Keyword(s):

Membrane Proteins ◽

Heterologous Expression ◽

Large Scale ◽

Structural Information ◽

Structural Data ◽

Structural Proteomics ◽

Scale Production ◽

Expression And Purification ◽

Large Scale Production ◽

Heterologous Expression Systems

Membrane proteins (MPs) are responsible for the interface between the exterior and the interior of the cell. These proteins are implicated in numerous diseases, such as cancer, cystic fibrosis, epilepsy, hyperinsulinism, heart failure, hypertension and Alzheimer's disease. However, studies on these disorders are hampered by a lack of structural information about the proteins involved. Structural analysis requires large quantities of pure and active proteins. The majority of medically and pharmaceutically relevant MPs are present in tissues at very low concentration, which makes heterologous expression in large-scale production-adapted cells a prerequisite for structural studies. Obtaining mammalian MP structural data depends on the development of methods that allow the production of large quantities of MPs. This review focuses on the different heterologous expression systems, and the purification strategies, used to produce large amounts of pure mammalian MPs for structural proteomics.

Download Full-text

Integrating genome sequence and structural data for statistical learning to predict transcription factor binding sites

Nucleic Acids Research ◽

10.1093/nar/gkaa1134 ◽

2020 ◽

Vol 48 (22) ◽

pp. 12604-12617

Author(s):

Pengpeng Long ◽

Lu Zhang ◽

Bin Huang ◽

Quan Chen ◽

Haiyan Liu

Keyword(s):

Genome Sequence ◽

Energy Function ◽

Structural Information ◽

Structural Data ◽

P Values ◽

A Genome ◽

Z Scores ◽

Transcription Regulators ◽

Dna Specificity ◽

Tetracycline Repressor

Abstract We report an approach to predict DNA specificity of the tetracycline repressor (TetR) family transcription regulators (TFRs). First, a genome sequence-based method was streamlined with quantitative P-values defined to filter out reliable predictions. Then, a framework was introduced to incorporate structural data and to train a statistical energy function to score the pairing between TFR and TFR binding site (TFBS) based on sequences. The predictions benchmarked against experiments, TFBSs for 29 out of 30 TFRs were correctly predicted by either the genome sequence-based or the statistical energy-based method. Using P-values or Z-scores as indicators, we estimate that 59.6% of TFRs are covered with relatively reliable predictions by at least one of the two methods, while only 28.7% are covered by the genome sequence-based method alone. Our approach predicts a large number of new TFBs which cannot be correctly retrieved from public databases such as FootprintDB. High-throughput experimental assays suggest that the statistical energy can model the TFBSs of a significant number of TFRs reliably. Thus the energy function may be applied to explore for new TFBSs in respective genomes. It is possible to extend our approach to other transcriptional factor families with sufficient structural information.

Download Full-text

A Novel Method to Predict Drug-Target Interactions Based on Large-Scale Graph Representation Learning

Cancers ◽

10.3390/cancers13092111 ◽

2021 ◽

Vol 13 (9) ◽

pp. 2111

Author(s):

Bo-Wei Zhao ◽

Zhu-Hong You ◽

Lun Hu ◽

Zhen-Hao Guo ◽

Lei Wang ◽

...

Keyword(s):

Drug Target ◽

Large Scale ◽

Computational Models ◽

Structural Information ◽

Characteristic Curve ◽

Representation Learning ◽

Graph Representation ◽

Convolutional Network ◽

Novel Method

Identification of drug-target interactions (DTIs) is a significant step in the drug discovery or repositioning process. Compared with the time-consuming and labor-intensive in vivo experimental methods, the computational models can provide high-quality DTI candidates in an instant. In this study, we propose a novel method called LGDTI to predict DTIs based on large-scale graph representation learning. LGDTI can capture the local and global structural information of the graph. Specifically, the first-order neighbor information of nodes can be aggregated by the graph convolutional network (GCN); on the other hand, the high-order neighbor information of nodes can be learned by the graph embedding method called DeepWalk. Finally, the two kinds of feature are fed into the random forest classifier to train and predict potential DTIs. The results show that our method obtained area under the receiver operating characteristic curve (AUROC) of 0.9455 and area under the precision-recall curve (AUPR) of 0.9491 under 5-fold cross-validation. Moreover, we compare the presented method with some existing state-of-the-art methods. These results imply that LGDTI can efficiently and robustly capture undiscovered DTIs. Moreover, the proposed model is expected to bring new inspiration and provide novel perspectives to relevant researchers.

Download Full-text

An Attention-Based Model Using Character Composition of Entities in Chinese Relation Extraction

Information ◽

10.3390/info11020079 ◽

2020 ◽

Vol 11 (2) ◽

pp. 79 ◽

Cited By ~ 2

Author(s):

Xiaoyu Han ◽

Yue Zhang ◽

Wenkai Zhang ◽

Tinglei Huang

Keyword(s):

Language Processing ◽

Large Scale ◽

Named Entity Recognition ◽

Relation Extraction ◽

Entity Recognition ◽

Additional Information ◽

Named Entity ◽

Proposed Model ◽

The Relationship ◽

Crucial Part

Relation extraction is a vital task in natural language processing. It aims to identify the relationship between two specified entities in a sentence. Besides information contained in the sentence, additional information about the entities is verified to be helpful in relation extraction. Additional information such as entity type getting by NER (Named Entity Recognition) and description provided by knowledge base both have their limitations. Nevertheless, there exists another way to provide additional information which can overcome these limitations in Chinese relation extraction. As Chinese characters usually have explicit meanings and can carry more information than English letters. We suggest that characters that constitute the entities can provide additional information which is helpful for the relation extraction task, especially in large scale datasets. This assumption has never been verified before. The main obstacle is the lack of large-scale Chinese relation datasets. In this paper, first, we generate a large scale Chinese relation extraction dataset based on a Chinese encyclopedia. Second, we propose an attention-based model using the characters that compose the entities. The result on the generated dataset shows that these characters can provide useful information for the Chinese relation extraction task. By using this information, the attention mechanism we used can recognize the crucial part of the sentence that can express the relation. The proposed model outperforms other baseline models on our Chinese relation extraction dataset.

Download Full-text

Mechanisms Applied by Protein Inhibitors to Inhibit Cysteine Proteases

International Journal of Molecular Sciences ◽

10.3390/ijms22030997 ◽

2021 ◽

Vol 22 (3) ◽

pp. 997

Author(s):

Livija Tušar ◽

Aleksandra Usenik ◽

Boris Turk ◽

Dušan Turk

Keyword(s):

Binding Site ◽

Structural Information ◽

General Rule ◽

Structural Data ◽

Cysteine Proteases ◽

Protein Inhibitors ◽

Active Site Cleft ◽

Living Organisms ◽

And Control ◽

Binding Mechanisms

Protein inhibitors of proteases are an important tool of nature to regulate and control proteolysis in living organisms under physiological and pathological conditions. In this review, we analyzed the mechanisms of inhibition of cysteine proteases on the basis of structural information and compiled kinetic data. The gathered structural data indicate that the protein fold is not a major obstacle for the evolution of a protease inhibitor. It appears that nature can convert almost any starting fold into an inhibitor of a protease. In addition, there appears to be no general rule governing the inhibitory mechanism. The structural data make it clear that the “lock and key” mechanism is a historical concept with limited validity. However, the analysis suggests that the shape of the active site cleft of proteases imposes some restraints. When the S1 binding site is shaped as a pocket buried in the structure of protease, inhibitors can apply substrate-like binding mechanisms. In contrast, when the S1 binding site is in part exposed to solvent, the substrate-like inhibition cannot be employed. It appears that all proteases, with the exception of papain-like proteases, belong to the first group of proteases. Finally, we show a number of examples and provide hints on how to engineer protein inhibitors.

Download Full-text

Efficient and High-Quality Seeded Graph Matching: Employing Higher-order Structural Information

ACM Transactions on Knowledge Discovery from Data ◽

10.1145/3442340 ◽

2021 ◽

Vol 15 (3) ◽

pp. 1-31

Author(s):

Haida Zhang ◽

Zengfeng Huang ◽

Xuemin Lin ◽

Zhe Lin ◽

Wenjie Zhang ◽

...

Keyword(s):

Large Scale ◽

Graph Matching ◽

Structural Information ◽

Experimental Studies ◽

Higher Order ◽

Personalized Pagerank ◽

Matching Accuracy ◽

Approximation Techniques ◽

Order Of Magnitude ◽

Matching Score

Driven by many real applications, we study the problem of seeded graph matching. Given two graphs and , and a small set of pre-matched node pairs where and , the problem is to identify a matching between and growing from , such that each pair in the matching corresponds to the same underlying entity. Recent studies on efficient and effective seeded graph matching have drawn a great deal of attention and many popular methods are largely based on exploring the similarity between local structures to identify matching pairs. While these recent techniques work provably well on random graphs, their accuracy is low over many real networks. In this work, we propose to utilize higher-order neighboring information to improve the matching accuracy and efficiency. As a result, a new framework of seeded graph matching is proposed, which employs Personalized PageRank (PPR) to quantify the matching score of each node pair. To further boost the matching accuracy, we propose a novel postponing strategy, which postpones the selection of pairs that have competitors with similar matching scores. We show that the postpone strategy indeed significantly improves the matching accuracy. To improve the scalability of matching large graphs, we also propose efficient approximation techniques based on algorithms for computing PPR heavy hitters. Our comprehensive experimental studies on large-scale real datasets demonstrate that, compared with state-of-the-art approaches, our framework not only increases the precision and recall both by a significant margin but also achieves speed-up up to more than one order of magnitude.

Download Full-text

High-throughput super-resolution analysis of influenza virus pleomorphism reveals insights into viral spatial organization

10.1101/2021.09.23.461536 ◽

2021 ◽

Author(s):

Andrew McMahon ◽

Rebecca Andrews ◽

Sohail V Ghani ◽

Thorben Cordes ◽

Achillefs N Kapanidis ◽

...

Keyword(s):

Large Scale ◽

Spatial Organization ◽

Structural Information ◽

Virus Assembly ◽

Super Resolution ◽

Automated Analysis ◽

Size Analysis ◽

Analysis Pipeline ◽

Single Experiment ◽

Viral Immunology

Many viruses form highly pleomorphic particles; in influenza, these particles range from spheres of ~ 100 nm in diameter to filaments of several microns in length. Virion structure is of interest, not only in the context of virus assembly, but also because pleomorphic variations may correlate with infectivity and pathogenicity. Detailed images of virus morphology often rely on electron microscopy, which is generally low throughput and limited in molecular identification. We have used fluorescence super-resolution microscopy combined with a rapid automated analysis pipeline to image many thousands of individual influenza virions, gaining information on their size, morphology and the distribution of membrane-embedded and internal proteins. This large-scale analysis revealed that influenza particles can be reliably characterised by length, that no spatial frequency patterning of the surface glycoproteins occurs, and that RNPs are preferentially located towards filament ends within Archetti bodies. Our analysis pipeline is versatile and can be adapted for use on multiple other pathogens, as demonstrated by its application for the size analysis of SARS-CoV-2. The ability to gain nanoscale structural information from many thousands of viruses in just a single experiment is valuable for the study of virus assembly mechanisms, host cell interactions and viral immunology, and should be able to contribute to the development of viral vaccines, anti-viral strategies and diagnostics.

Download Full-text

A billion synthetic 3D-antibody-antigen complexes enable unconstrained machine-learning formalized investigation of antibody specificity prediction

10.1101/2021.07.06.451258 ◽

2021 ◽

Author(s):

Philippe Auguste Robert ◽

Rahmad Akbar ◽

Robert Frank ◽

Milena Pavlović ◽

Michael Widrich ◽

...

Keyword(s):

Machine Learning ◽

In Silico ◽

Prediction Accuracy ◽

Large Scale ◽

Structural Information ◽

Antigen Binding ◽

Antibody Specificity ◽

Binding Prediction ◽

Information Encoding ◽

Prediction Problems

Machine learning (ML) is a key technology to enable accurate prediction of antibody-antigen binding, a prerequisite for in silico vaccine and antibody design. Two orthogonal problems hinder the current application of ML to antibody-specificity prediction and the benchmarking thereof: (i) The lack of a unified formalized mapping of immunological antibody specificity prediction problems into ML notation and (ii) the unavailability of large-scale training datasets. Here, we developed the Absolut! software suite that allows the parameter-based unconstrained generation of synthetic lattice-based 3D-antibody-antigen binding structures with ground-truth access to conformational paratope, epitope, and affinity. We show that Absolut!-generated datasets recapitulate critical biological sequence and structural features that render antibody-antigen binding prediction challenging. To demonstrate the immediate, high-throughput, and large-scale applicability of Absolut!, we have created an online database of 1 billion antibody-antigen structures, the extension of which is only constrained by moderate computational resources. We translated immunological antibody specificity prediction problems into ML tasks and used our database to investigate paratope-epitope binding prediction accuracy as a function of structural information encoding, dataset size, and ML method, which is unfeasible with existing experimental data. Furthermore, we found that in silico investigated conditions, predicted to increase antibody specificity prediction accuracy, align with and extend conclusions drawn from experimental antibody-antigen structural data. In summary, the Absolut! framework enables the development and benchmarking of ML strategies for biotherapeutics discovery and design.

Download Full-text

Recent progress in laser texturing of battery materials: a review of tuning electrochemical performances, related material development, and prospects for large-scale manufacturing

International Journal of Extreme Manufacturing ◽

10.1088/2631-7990/abca84 ◽

2020 ◽

Vol 3 (1) ◽

pp. 012002

Author(s):

Wilhelm Pfleging

Keyword(s):

Large Scale ◽

Recent Progress ◽

Electrochemical Performances ◽

Laser Texturing ◽

Battery Materials ◽

Related Material ◽

Material Development

Download Full-text