Examining the Effect of the Ratio of Biomedical Domain to General Domain Data in Corpus in Biomedical Literature Mining

Biomedical terms extracted using Word2vec, the most popular word embedding model in recent years, serve as the foundation for various natural language processing (NLP) applications, such as biomedical information retrieval, relation extraction, and recommendation systems. The objective of this study is to examine how changes in the ratio of the biomedical domain to general domain data in the corpus affect the extraction of similar biomedical terms using Word2vec. We downloaded abstracts of 214,892 articles from PubMed Central (PMC) and the 3.9 GB Billion Word (BW) benchmark corpus from the computer science community. The datasets were preprocessed and grouped into 11 corpora based on the ratio of BW to PMC, ranging from 0:10 to 10:0, and then Word2vec models were trained on these corpora. The cosine similarities between the biomedical terms obtained from the Word2vec models were then compared in each model. The results indicated that the models trained with both BW and PMC data outperformed the model trained only with medical data. The similarity between the biomedical terms extracted by the Word2vec model increased when the ratio of the biomedical domain to general domain data was 3:7 to 5:5. This study allows NLP researchers to apply Word2vec based on more information and increase the similarity of extracted biomedical terms to improve their effectiveness in NLP applications, such as biomedical information extraction.

Download Full-text

Extraction of similar biomedical terms in biomedical literature mining: Examining the effect of the ratio of biomedical domain to general domain data (Preprint)

10.2196/preprints.30300 ◽

2021 ◽

Author(s):

Ziheng Zhang ◽

Feng Han ◽

Hongjian Zhang ◽

Tomohiro Aoki ◽

Katsuhiko Ogasawara

Keyword(s):

Language Processing ◽

Relation Extraction ◽

Medical Data ◽

Biomedical Literature ◽

Literature Mining ◽

Biomedical Domain ◽

Pubmed Central ◽

General Domain ◽

Biomedical Information Retrieval ◽

Science Community

BACKGROUND Biomedical terms extracted using Word2vec, the most popular word embedding model in recent years, serve as the foundation for various natural language processing (NLP) applications, such as biomedical information retrieval, relation extraction, and recommendation systems. OBJECTIVE The objective of this study is to examine how changes in the ratio of biomedical domain to general domain data in the corpus affect the extraction of similar biomedical terms using Word2vec. METHODS We downloaded abstracts of 214892 articles from PubMed Central (PMC) and the 3.9 GB Billion Word (BW) benchmark corpus from the computer science community. The datasets were preprocessed and grouped into 11 corpora based on the ratio of BW to PMC, ranging from 0:10 to 10:0, and then Word2vec models were trained on these corpora. The cosine similarities between the biomedical terms obtained from the Word2vec models were then compared in each model. RESULTS The results indicated that the models trained with both BW and PMC data outperformed the model trained only with medical data. The similarity between the biomedical terms extracted by the Word2vec model increased, when the ratio of biomedical domain to general domain data was 3: 7 to 5: 5. CONCLUSIONS This study allows NLP researchers to apply Word2vec based on more information and increase the similarity of extracted biomedical terms to improve their effectiveness in NLP applications, such as biomedical information extraction.

Download Full-text

LBERT: Lexically aware Transformer-based Bidirectional Encoder Representation model for learning universal bio-entity relations

Bioinformatics ◽

10.1093/bioinformatics/btaa721 ◽

2020 ◽

Author(s):

Neha Warikoo ◽

Yung-Chun Chang ◽

Wen-Lian Hsu

Keyword(s):

Deep Learning ◽

Language Processing ◽

Predictive Analytics ◽

Relation Extraction ◽

Data Representation ◽

Supplementary Information ◽

Biomedical Domain ◽

Critical Function ◽

Representation Model ◽

Classification Tasks

Abstract Motivation Natural Language Processing techniques are constantly being advanced to accommodate the influx of data as well as to provide exhaustive and structured knowledge dissemination. Within the biomedical domain, relation detection between bio-entities known as the Bio-Entity Relation Extraction (BRE) task has a critical function in knowledge structuring. Although recent advances in deep learning-based biomedical domain embedding have improved BRE predictive analytics, these works are often task selective or use external knowledge-based pre-/post-processing. In addition, deep learning-based models do not account for local syntactic contexts, which have improved data representation in many kernel classifier-based models. In this study, we propose a universal BRE model, i.e. LBERT, which is a Lexically aware Transformer-based Bidirectional Encoder Representation model, and which explores both local and global contexts representations for sentence-level classification tasks. Results This article presents one of the most exhaustive BRE studies ever conducted over five different bio-entity relation types. Our model outperforms state-of-the-art deep learning models in protein–protein interaction (PPI), drug–drug interaction and protein–bio-entity relation classification tasks by 0.02%, 11.2% and 41.4%, respectively. LBERT representations show a statistically significant improvement over BioBERT in detecting true bio-entity relation for large corpora like PPI. Our ablation studies clearly indicate the contribution of the lexical features and distance-adjusted attention in improving prediction performance by learning additional local semantic context along with bi-directionally learned global context. Availability and implementation Github. https://github.com/warikoone/LBERT. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Auto-CORPus: Automated and Consistent Outputs from Research Publications

10.1101/2021.01.08.425887 ◽

2021 ◽

Author(s):

Yan Hu ◽

Shujian Sun ◽

Thomas Rowlands ◽

Tim Beck ◽

Joram Matthias Posma

Keyword(s):

Language Processing ◽

Association Studies ◽

Biomedical Literature ◽

Pubmed Central ◽

Manual Curation ◽

Genome Wide ◽

Ontology Extraction ◽

Table Data ◽

Machine Readable ◽

Readable Format

Motivation: The availability of improved natural language processing (NLP) algorithms and models enable researchers to analyse larger corpora using open source tools. Text mining of biomedical literature is one area for which NLP has been used in recent years with large untapped potential. However, in order to generate corpora that can be analyzed using machine learning NLP algorithms, these need to be standardized. Summarizing data from literature to be stored into databases typically requires manual curation, especially for extracting data from result tables. Results: We present here an automated pipeline that cleans HTML files from biomedical literature. The output is a single JSON file that contains the text for each section, table data in machine-readable format and lists of phenotypes and abbreviations found in the article. We analyzed a total of 2,441 Open Access articles from PubMed Central, from both Genome-Wide and Metabolome-Wide Association Studies, and developed a model to standardize the section headers based on the Information Artifact Ontology. Extraction of table data was developed on PubMed articles and fine-tuned using the equivalent publisher versions. Availability: The Auto-CORPus package is freely available with detailed instructions from Github at https://github.com/jmp111/AutoCORPus/.

Download Full-text

Using distant supervision to augment manually annotated data for relation extraction

10.1101/626226 ◽

2019 ◽

Author(s):

Peng Su ◽

Gang Li ◽

Cathy Wu ◽

K. Vijay-Shanker

Keyword(s):

Deep Learning ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Relation Extraction ◽

Biomedical Literature ◽

Training Data ◽

Distant Supervision ◽

Large Size ◽

Domain Expertise

AbstractSignificant progress has been made in applying deep learning on natural language processing tasks recently. However, deep learning models typically require a large amount of annotated training data while often only small labeled datasets are available for many natural language processing tasks in biomedical literature. Building large-size datasets for deep learning is expensive since it involves considerable human effort and usually requires domain expertise in specialized fields. In this work, we consider augmenting manually annotated data with large amounts of data using distant supervision. However, data obtained by distant supervision is often noisy, we first apply some heuristics to remove some of the incorrect annotations. Then using methods inspired from transfer learning, we show that the resulting models outperform models trained on the original manually annotated sets.

Download Full-text

Biomedical Literature Mining for Biomedical Relation Extraction

International Journal of Computer Sciences and Engineering ◽

10.26438/ijcse/v6i8.8493 ◽

2018 ◽

Vol 6 (8) ◽

pp. 84-93

Author(s):

Jahiruddin .

Keyword(s):

Relation Extraction ◽

Biomedical Literature ◽

Literature Mining ◽

Biomedical Literature Mining ◽

Biomedical Relation Extraction

Download Full-text

Information Extraction in the Medical Domain

Journal of Information Technology Research ◽

10.4018/jitr.2015040101 ◽

2015 ◽

Vol 8 (2) ◽

pp. 1-15 ◽

Cited By ~ 1

Author(s):

Aicha Ghoulam ◽

Fatiha Barigou ◽

Ghalem Belalem

Keyword(s):

Natural Language ◽

Information Extraction ◽

Language Processing ◽

Personal Information ◽

Recognition Task ◽

Relation Extraction ◽

Entity Recognition ◽

Machine Learning Techniques ◽

Literature Mining ◽

Advantages And Disadvantages

Information Extraction (IE) is a natural language processing (NLP) task whose aim is to analyse texts written in natural language to extract structured and useful information such as named entities and semantic relations between them. Information extraction is an important task in a diverse set of applications like bio-medical literature mining, customer care, community websites, personal information management and so on. In this paper, the authors focus only on information extraction from clinical reports. The two most fundamental tasks in information extraction are discussed; namely, named entity recognition task and relation extraction task. The authors give details about the most used rule/pattern-based and machine learning techniques for each task. They also make comparisons between these techniques and summarize the advantages and disadvantages of each one.

Download Full-text

Investigation of Improving The Pre-Training And Fine-Tuning of BERT Model For Biomedical Relation Extraction

10.21203/rs.3.rs-640112/v1 ◽

2021 ◽

Author(s):

Peng Su ◽

K. Vijay-Shanker

Keyword(s):

Language Processing ◽

Domain Knowledge ◽

Model Performance ◽

Relation Extraction ◽

Biomedical Literature ◽

Fine Tuning ◽

Score Improvement ◽

Model Generalization ◽

Training Stage ◽

Benchmark Datasets

Abstract Background: Recently, automatically extracting biomedical relations has been a significant subject in biomedical research due to the rapid growth of biomedical literature. Since the adaptation to the biomedical domain, the transformer-based BERT models have produced leading results on many biomedical natural language processing tasks. In this work, we will explore the approaches to improve the BERT model for relation extraction tasks in both the pre-training and fine-tuning stages of its applications. In the pre-training stage, we add another level of BERT adaptation on sub-domain data to bridge the gap between domain knowledge and task-specific knowledge. Also, we propose methods to incorporate the ignored knowledge in the last layer of BERT to improve its fine-tuning. Results: The experiment results demonstrate that our approaches for pre-training and fine-tuning can improve the BERT model performance. After combining the two proposed techniques, our approach outperforms the original BERT models with averaged F1 score improvement of 2.1% on relation extraction tasks. Moreover, our approach achieves state-of-the-art performance on three relation extraction benchmark datasets. Conclusions: The extra pre-training step on sub-domain data can help the BERT model generalization on specific tasks, and our proposed fine-tuning mechanism could utilize the knowledge in the last layer of BERT to boost the model performance. Furthermore, the combination of these two approaches further improves the performance of BERT model on the relation extraction tasks.

Download Full-text

deepMINE - Natural Language Processing based Automatic Literature Mining and Research Summarization for Early-Stage Comprehension in Pandemic Situations specifically for COVID-19

10.1101/2020.03.30.014555 ◽

2020 ◽

Author(s):

Bhrugesh Joshi ◽

Vishvajit Bakarola ◽

Parth Shah ◽

Ramar Krishnamurthy

Keyword(s):

Language Processing ◽

Large Scale ◽

Early Stage ◽

Biomedical Literature ◽

Research Articles ◽

Literature Mining ◽

Critical Research ◽

Mining System ◽

Manual Processing ◽

Novel Coronavirus

AbstractThe recent pandemic created due to Novel Coronavirus (nCOV-2019) from Wuhan, China demanding a large scale of a general health emergency. This demands novel research on the vaccine to fight against this pandemic situation, re-purposing of the existing drugs, phylogenetic analysis to identify the origin and determine the similarity with other known viruses, etc. The very preliminary task from the research community is to analyze the wide verities of existing related research articles, which is very much time-consuming in such situations where each minute counts for saving hundreds of human lives. The entire manual processing is even lower down the efficiency in mining the information. We have developed a complete automatic literature mining system that delivers efficient and fast mining from existing biomedical literature databases. With the help of modern-day deep learning algorithms, our system also delivers a summarization of important research articles that provides ease and fast comprehension of critical research articles. The system is currently scanning nearly 1,46,115,136 English words from 29,315 research articles in not greater than 1.5 seconds with multiple search keywords. Our research article presents the criticality of literature mining, especially in pandemic situations with the implementation and online deployment of the system.

Download Full-text

Recent advances in biomedical literature mining

Briefings in Bioinformatics ◽

10.1093/bib/bbaa057 ◽

2020 ◽

Cited By ~ 2

Author(s):

Sendong Zhao ◽

Chang Su ◽

Zhiyong Lu ◽

Fei Wang

Keyword(s):

Domain Knowledge ◽

Biomedical Literature ◽

Superior Performance ◽

Literature Mining ◽

Biomedical Domain ◽

Electronic Format ◽

Research Directions ◽

Recent Advances ◽

Biomedical Literature Mining ◽

New Research

Abstract The recent years have witnessed a rapid increase in the number of scientific articles in biomedical domain. These literature are mostly available and readily accessible in electronic format. The domain knowledge hidden in them is critical for biomedical research and applications, which makes biomedical literature mining (BLM) techniques highly demanding. Numerous efforts have been made on this topic from both biomedical informatics (BMI) and computer science (CS) communities. The BMI community focuses more on the concrete application problems and thus prefer more interpretable and descriptive methods, while the CS community chases more on superior performance and generalization ability, thus more sophisticated and universal models are developed. The goal of this paper is to provide a review of the recent advances in BLM from both communities and inspire new research directions.

Download Full-text

Preprocessing in Biomedical Literature Mining Using Natural Language Processing

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.687-691.1149 ◽

2014 ◽

Vol 687-691 ◽

pp. 1149-1152

Author(s):

Jing Peng ◽

Hong Min Sun

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Biomedical Literature ◽

Literature Mining ◽

Noise Data ◽

Text Preprocessing ◽

Biomedical Literature Mining ◽

The Web

The number of biomedical literatures is growing rapidly, and biomedical literature mining is becoming essential. An approach for article processing in text preprocessing is proposed in order to improve the performance of biomedical literature mining. This approach combines the Web and corpus counts in order to eliminate the limitations of noise data of the Web. We experimentally showed that the performance of the combination models is the best comparing to the pure Web and corpus models. We achieve the best precision of 89.1% on all article forms and 88.7% article loss class.

Download Full-text