Biomedical Literature Mining for Biomedical Relation Extraction

Jahiruddin .

doi:10.26438/ijcse/v6i8.8493

PALMER: improving pathway annotation based on the biomedical literature mining with a constrained latent block model

BMC Bioinformatics ◽

10.1186/s12859-020-03756-3 ◽

2020 ◽

Vol 21 (1) ◽

Author(s):

Jin Hyun Nam ◽

Daniel Couch ◽

Willian A. da Silveira ◽

Zhenning Yu ◽

Dongjun Chung

Keyword(s):

Clustering Algorithms ◽

Biomedical Literature ◽

Literature Mining ◽

Biological Knowledge ◽

Block Model ◽

Prior Biological Knowledge ◽

Gene Associations ◽

Biological Interpretation ◽

Biomedical Literature Mining ◽

Block Models

Abstract Background In systems biology, it is of great interest to identify previously unreported associations between genes. Recently, biomedical literature has been considered as a valuable resource for this purpose. While classical clustering algorithms have popularly been used to investigate associations among genes, they are not tuned for the literature mining data and are also based on strong assumptions, which are often violated in this type of data. For example, these approaches often assume homogeneity and independence among observations. However, these assumptions are often violated due to both redundancies in functional descriptions and biological functions shared among genes. Latent block models can be alternatives in this case but they also often show suboptimal performances, especially when signals are weak. In addition, they do not allow to utilize valuable prior biological knowledge, such as those available in existing databases. Results In order to address these limitations, here we propose PALMER, a constrained latent block model that allows to identify indirect relationships among genes based on the biomedical literature mining data. By automatically associating relevant Gene Ontology terms, PALMER facilitates biological interpretation of novel findings without laborious downstream analyses. PALMER also allows researchers to utilize prior biological knowledge about known gene-pathway relationships to guide identification of gene–gene associations. We evaluated PALMER with simulation studies and applications to studies of pathway-modulating genes relevant to cancer signaling pathways, while utilizing biological pathway annotations available in the KEGG database as prior knowledge. Conclusions We showed that PALMER outperforms traditional latent block models and it provides reliable identification of novel gene–gene associations by utilizing prior biological knowledge, especially when signals are weak in the biomedical literature mining dataset. We believe that PALMER and its relevant user-friendly software will be powerful tools that can be used to improve existing pathway annotations and identify novel pathway-modulating genes.

Download Full-text

Identification of most influential co-occurring gene suites for gastrointestinal cancer using biomedical literature mining and graph-based influence maximization

BMC Medical Informatics and Decision Making ◽

10.1186/s12911-020-01227-6 ◽

2020 ◽

Vol 20 (1) ◽

Author(s):

Charles C. N. Wang ◽

Jennifer Jin ◽

Jan-Gowth Chang ◽

Masahiro Hayakawa ◽

Atsushi Kitazawa ◽

...

Keyword(s):

Gastrointestinal Cancer ◽

Influence Maximization ◽

Biomedical Literature ◽

Literature Mining ◽

Biomedical Literature Mining

Download Full-text

Big Data Framework for Scalable and Efficient Biomedical Literature Mining in the Cloud

Proceedings of the 2019 3rd International Conference on Natural Language Processing and Information Retrieval - NLPIR 2019 ◽

10.1145/3342827.3342843 ◽

2019 ◽

Cited By ~ 1

Author(s):

Zhengru Shen ◽

Xi Wang ◽

Marco Spruit

Keyword(s):

Big Data ◽

Biomedical Literature ◽

Literature Mining ◽

Data Framework ◽

Biomedical Literature Mining

Download Full-text

Exploring concept graphs for biomedical literature mining

2015 International Conference on Big Data and Smart Computing (BIGCOMP) ◽

10.1109/35021bigcomp.2015.7072818 ◽

2015 ◽

Cited By ~ 2

Author(s):

Min Song

Keyword(s):

Biomedical Literature ◽

Literature Mining ◽

Biomedical Literature Mining

Download Full-text

Time-based discovery in biomedical literature: mining temporal links

International Journal of Data Analysis Techniques and Strategies ◽

10.1504/ijdats.2013.053679 ◽

2013 ◽

Vol 5 (2) ◽

pp. 148 ◽

Cited By ~ 4

Author(s):

Corrado Loglisci

Keyword(s):

Biomedical Literature ◽

Literature Mining ◽

Biomedical Literature Mining

Download Full-text

A Framework for Semisupervised Feature Generation and Its Applications in Biomedical Literature Mining

IEEE/ACM Transactions on Computational Biology and Bioinformatics ◽

10.1109/tcbb.2010.99 ◽

2011 ◽

Vol 8 (2) ◽

pp. 294-307 ◽

Cited By ~ 10

Author(s):

Yanpeng Li ◽

Xiaohua Hu ◽

Hongfei Lin ◽

Zhiahi Yang

Keyword(s):

Biomedical Literature ◽

Literature Mining ◽

Feature Generation ◽

Biomedical Literature Mining

Download Full-text

PPPred: Classifying Protein-phenotype Co-mentions Extracted from Biomedical Literature

10.1101/654475 ◽

2019 ◽

Author(s):

Morteza Pourreza Shahri ◽

Mandi M. Roe ◽

Gillian Reynolds ◽

Indika Kahanda

Keyword(s):

Relation Extraction ◽

Biomedical Literature ◽

Supervised Machine Learning ◽

Human Phenotype ◽

Unstructured Text ◽

Gold Standard Dataset ◽

Sentence Level ◽

Machine Learning Approach ◽

Human Proteins ◽

Biomedical Relation Extraction

ABSTRACTThe MEDLINE database provides an extensive source of scientific articles and heterogeneous biomedical information in the form of unstructured text. One of the most important knowledge present within articles are the relations between human proteins and their phenotypes, which can stay hidden due to the exponential growth of publications. This has presented a range of opportunities for the development of computational methods to extract these biomedical relations from the articles. However, currently, no such method exists for the automated extraction of relations involving human proteins and human phenotype ontology (HPO) terms. In our previous work, we developed a comprehensive database composed of all co-mentions of proteins and phenotypes. In this study, we present a supervised machine learning approach called PPPred (Protein-Phenotype Predictor) for classifying the validity of a given sentence-level co-mention. Using an in-house developed gold standard dataset, we demonstrate that PPPred significantly outperforms several baseline methods. This two-step approach of co-mention extraction and classification constitutes a complete biomedical relation extraction pipeline for extracting protein-phenotype relations.CCS CONCEPTS•Computing methodologies → Information extraction; Supervised learning by classification; •Applied computing →Bioinformatics;

Download Full-text

Using biomedical literature mining to consolidate the set of known human protein-protein interactions

10.3115/1641484.1641491 ◽

2005 ◽

Cited By ~ 3

Author(s):

Arun Ramani ◽

Edward Marcotte ◽

Razvan Bunescu ◽

Raymond Mooney

Keyword(s):

Protein Interactions ◽

Biomedical Literature ◽

Human Protein ◽

Literature Mining ◽

Protein Protein Interactions ◽

Biomedical Literature Mining

Download Full-text

Extraction of similar biomedical terms in biomedical literature mining: Examining the effect of the ratio of biomedical domain to general domain data (Preprint)

10.2196/preprints.30300 ◽

2021 ◽

Author(s):

Ziheng Zhang ◽

Feng Han ◽

Hongjian Zhang ◽

Tomohiro Aoki ◽

Katsuhiko Ogasawara

Keyword(s):

Language Processing ◽

Relation Extraction ◽

Medical Data ◽

Biomedical Literature ◽

Literature Mining ◽

Biomedical Domain ◽

Pubmed Central ◽

General Domain ◽

Biomedical Information Retrieval ◽

Science Community

BACKGROUND Biomedical terms extracted using Word2vec, the most popular word embedding model in recent years, serve as the foundation for various natural language processing (NLP) applications, such as biomedical information retrieval, relation extraction, and recommendation systems. OBJECTIVE The objective of this study is to examine how changes in the ratio of biomedical domain to general domain data in the corpus affect the extraction of similar biomedical terms using Word2vec. METHODS We downloaded abstracts of 214892 articles from PubMed Central (PMC) and the 3.9 GB Billion Word (BW) benchmark corpus from the computer science community. The datasets were preprocessed and grouped into 11 corpora based on the ratio of BW to PMC, ranging from 0:10 to 10:0, and then Word2vec models were trained on these corpora. The cosine similarities between the biomedical terms obtained from the Word2vec models were then compared in each model. RESULTS The results indicated that the models trained with both BW and PMC data outperformed the model trained only with medical data. The similarity between the biomedical terms extracted by the Word2vec model increased, when the ratio of biomedical domain to general domain data was 3: 7 to 5: 5. CONCLUSIONS This study allows NLP researchers to apply Word2vec based on more information and increase the similarity of extracted biomedical terms to improve their effectiveness in NLP applications, such as biomedical information extraction.

Download Full-text