SicknessMiner: a deep-learning-driven text-mining tool to abridge disease-disease associations

Abstract Background Blood cancers (BCs) are responsible for over 720 K yearly deaths worldwide. Their prevalence and mortality-rate uphold the relevance of research related to BCs. Despite the availability of different resources establishing Disease-Disease Associations (DDAs), the knowledge is scattered and not accessible in a straightforward way to the scientific community. Here, we propose SicknessMiner, a biomedical Text-Mining (TM) approach towards the centralization of DDAs. Our methodology encompasses Named Entity Recognition (NER) and Named Entity Normalization (NEN) steps, and the DDAs retrieved were compared to the DisGeNET resource for qualitative and quantitative comparison. Results We obtained the DDAs via co-mention using our SicknessMiner or gene- or variant-disease similarity on DisGeNET. SicknessMiner was able to retrieve around 92% of the DisGeNET results and nearly 15% of the SicknessMiner results were specific to our pipeline. Conclusions SicknessMiner is a valuable tool to extract disease-disease relationship from RAW input corpus.

Download Full-text

A Neural Named Entity Recognition and Multi-Type Normalization Tool for Biomedical Text Mining

IEEE Access ◽

10.1109/access.2019.2920708 ◽

2019 ◽

Vol 7 ◽

pp. 73729-73740 ◽

Cited By ~ 13

Author(s):

Donghyeon Kim ◽

Jinhyuk Lee ◽

Chan Ho So ◽

Hwisang Jeon ◽

Minbyul Jeong ◽

...

Keyword(s):

Text Mining ◽

Named Entity Recognition ◽

Entity Recognition ◽

Biomedical Text ◽

Biomedical Text Mining ◽

Named Entity

Download Full-text

Tagger: BeCalm API for rapid named entity recognition

10.1101/115022 ◽

2017 ◽

Cited By ~ 2

Author(s):

Lars Juhl Jensen

Keyword(s):

Open Access ◽

Text Mining ◽

Real Time ◽

Named Entity Recognition ◽

Entity Recognition ◽

The Real ◽

Practical Applications ◽

Named Entity ◽

Highly Efficient

AbstractMost BioCreative tasks to date have focused on assessing the quality of text-mining annotations in terms of precision of recall. Interoperability, speed, and stability are, however, other important factors to consider for practical applications of text mining. The new BioCreative/BeCalm TIPS task focuses purely on these. To participate in this task, I implemented a BeCalm API within the real-time tagging server also used by the Reflect and EXTRACT tools. In addition to retrieval of patent abstracts, PubMed abstracts, and Pub-Med Central open-access articles as required in the TIPS task, the BeCalm API implementation facilitates retrieval of documents from other sources specified as custom request parameters. As in earlier tests, the tagger proved to be both highly efficient and stable, being able to consistently process requests of 5000 abstracts in less than half a minute including retrieval of the document text.

Download Full-text

OryzaGP 2021 update: a rice gene and protein dataset for named-entity recognition

Genomics & Informatics ◽

10.5808/gi.21015 ◽

2021 ◽

Vol 19 (3) ◽

pp. e27

Author(s):

Pierre Larmande ◽

Yusha Liu ◽

Xinzhi Yao ◽

Jingbo Xia

Keyword(s):

Natural Language ◽

Language Processing ◽

Rapid Evolution ◽

Named Entity Recognition ◽

Entity Recognition ◽

Biological Databases ◽

Named Entity ◽

Biological Domain ◽

Or Gene ◽

Protein Dataset

Due to the rapid evolution of high-throughput technologies, a tremendous amount of data is being produced in the biological domain, which poses a challenging task for information extraction and natural language understanding. Biological named entity recognition (NER) and named entity normalisation (NEN) are two common tasks aiming at identifying and linking biologically important entities such as genes or gene products mentioned in the literature to biological databases. In this paper, we present an updated version of OryzaGP, a gene and protein dataset for rice species created to help natural language processing (NLP) tools in processing NER and NEN tasks. To create the dataset, we selected more than 15,000 abstracts associated with articles previously curated for rice genes. We developed four dictionaries of gene and protein names associated with database identifiers. We used these dictionaries to annotate the dataset. We also annotated the dataset using pre-trained NLP models. Finally, we analysed the annotation results and discussed how to improve OryzaGP.

Download Full-text

Istex: A Database of Twenty Million Scientific Papers with a Mining Tool Which Uses Named Entities

Information ◽

10.3390/info10050178 ◽

2019 ◽

Vol 10 (5) ◽

pp. 178 ◽

Cited By ~ 1

Author(s):

Denis Maurel ◽

Enza Morale ◽

Nicolas Thouvenin ◽

Patrice Ringot ◽

Angel Turri

Keyword(s):

Full Text ◽

Named Entity Recognition ◽

Entity Recognition ◽

Good Precision ◽

Named Entities ◽

French Government ◽

Named Entity ◽

Scientific Papers ◽

Mining Tool ◽

Short Time

Istex is a database of twenty million full text scientific papers bought by the French Government for the use of academic libraries. Papers are usually searched for by the title, authors, keywords or possibly the abstract. To authorize new types of queries of Istex, we implemented a system of named entity recognition on all papers and we offer users the possibility to run searches on these entities. After the presentation of the French Istex project, we detail in this paper the named entity recognition with CasEN, a cascade of graphs, implemented on the Unitex Software. CasEN exists in French, but not in English. The first challenge was to build a new cascade in a short time. The results of its evaluation showed a good Precision measure, even if the Recall was not very good. The Precision was very important for this project to ensure it did not return unwanted papers by a query. The second challenge was the implementation of Unitex to parse around twenty millions of documents. We used a dockerized application. Finally, we explain also how to query the resulting Named entities in the Istex website.

Download Full-text

Text mining of 15 million full-text scientific articles

10.1101/162099 ◽

2017 ◽

Cited By ~ 5

Author(s):

David Westergaard ◽

Hans-Henrik Stærfeldt ◽

Christian Tønsberg ◽

Lars Juhl Jensen ◽

Søren Brunak

Keyword(s):

Text Mining ◽

Full Text ◽

Disease Gene ◽

Scientific Literature ◽

Named Entity Recognition ◽

Recognition System ◽

Entity Recognition ◽

Data Sets ◽

Named Entity ◽

Benchmark Data

AbstractAcross academia and industry, text mining has become a popular strategy for keeping up with the rapid growth of the scientific literature. Text mining of the scientific literature has mostly been carried out on collections of abstracts, due to their availability. Here we present an analysis of 15 million English scientific full-text articles published during the period 1823–2016. We describe the development in article length and publication sub-topics during these nearly 250 years. We showcase the potential of text mining by extracting published protein–protein, disease–gene, and protein subcellular associations using a named entity recognition system, and quantitatively report on their accuracy using gold standard benchmark data sets. We subsequently compare the findings to corresponding results obtained on 16.5 million abstracts included in MEDLINE and show that text mining of full-text articles consistently outperforms using abstracts only.

Download Full-text

A BERT-Based Hybrid System for Chemical Identification and Indexing in Full-Text Articles

10.1101/2021.10.27.466183 ◽

2021 ◽

Author(s):

Arslan Erdengasileng ◽

Keqiao Li ◽

Qing Han ◽

Shubo Tian ◽

Jian Wang ◽

...

Keyword(s):

Text Mining ◽

Information Extraction ◽

Full Text ◽

Data Augmentation ◽

Named Entity Recognition ◽

Entity Recognition ◽

Chemical Identification ◽

Dictionary Matching ◽

Named Entity ◽

Chemical Named Entity Recognition

Identification and indexing of chemical compounds in full-text articles are essential steps in biomedical article categorization, information extraction, and biological text mining. BioCreative Challenge was established to evaluate methods for biological text mining and information extraction. Track 2 of BioCreative VII (summer 2021) consists of two subtasks: chemical identification and chemical indexing in full-text PubMed articles. The chemical identification subtask also includes two parts: chemical named entity recognition (NER) and chemical normalization. In this paper, we present our work on developing a hybrid pipeline for chemical named entity recognition, chemical normalization, and chemical indexing in full-text PubMed articles. Specifically, we applied BERT-based methods for chemical NER and chemical indexing, and a sieve-based dictionary matching method for chemical normalization. For subtask 1, we used PubMedBERT with data augmentation on the chemical NER task. Several chemical-MeSH dictionaries including MeSH.XML, SUPP.XML, MRCONSO.RFF, and PubTator chemical annotations are used in a specific order to get the best performance on chemical normalization. We achieved an F1 score of 0.86 and 0.7668 on chemical NER and chemical normalization, respectively. For subtask 2, we formulated it as a binary prediction problem for each individual chemical compound name. We then used a BERT-based model with engineered features and achieved a strict F1 score of 0.4825 on the test set, which is substantially higher than the median F1 score (0.3971) of all the submissions.

Download Full-text

Text Mining of Disease-lifestyle Associations to Explain Comorbidities in Electronic Health Registries

10.1101/168211 ◽

2017 ◽

Author(s):

Lars Juhl Jensen

Keyword(s):

Text Mining ◽

Named Entity Recognition ◽

Lifestyle Factors ◽

Entity Recognition ◽

Named Entity ◽

Danish Health ◽

Underlying Causes ◽

Electronic Health ◽

Substance Consumption ◽

Health Registry

Mining of electronic health registries can reveal vast numbers of disease correlations (from hereon referred to as comorbidities for simplicity). However, the underlying causes can be hard to identify, in part because health registries usually do not record important lifestyle factors such as diet, substance consumption, and physical activity. To address this challenge, I developed a text-mining approach that uses dictionaries of diseases and lifestyle factors for named entity recognition and subsequently for co-occurrence extraction of disease–lifestyle associations from Medline. I show that this approach is able to extract many correct associations and provide proof-of-concept that these can provide plausible explanations for comorbidities observed in Swedish and Danish health registry data.

Download Full-text

Context-aware multi-token concept recognition of biological entities

BMC Bioinformatics ◽

10.1186/s12859-021-04248-8 ◽

2021 ◽

Vol 22 (S11) ◽

Author(s):

Kwangmin Kim ◽

Doheon Lee

Keyword(s):

Language Processing ◽

Contextual Information ◽

Named Entity Recognition ◽

Knowledge Bases ◽

Entity Recognition ◽

Biological Knowledge ◽

Concept Recognition ◽

Named Entity ◽

Named Entity Normalization ◽

Biological Entities

Abstract Background Concept recognition is a term that corresponds to the two sequential steps of named entity recognition and named entity normalization, and plays an essential role in the field of bioinformatics. However, the conventional dictionary-based methods did not sufficiently addressed the variation of the concepts in actual use in literature, resulting in the particularly degraded performances in recognition of multi-token concepts. Results In this paper, we propose a concept recognition method of multi-token biological entities using neural models combined with literature contexts. The key aspect of our method is utilizing the contextual information from the biological knowledge-bases for concept normalization, which is followed by named entity recognition procedure. The model showed improved performances over conventional methods, particularly for multi-token concepts with higher variations. Conclusions We expect that our model can be utilized for effective concept recognition and variety of natural language processing tasks on bioinformatics.

Download Full-text

Text Mining and Hub Gene Network Analysis of Endometriosis

BioMed Research International ◽

10.1155/2021/5517145 ◽

2021 ◽

Vol 2021 ◽

pp. 1-10

Author(s):

Yinuo Wang ◽

Songbiao Zhu ◽

Chengcheng Liu ◽

Haiteng Deng ◽

Zhenyu Zhang

Keyword(s):

Text Mining ◽

Interaction Analysis ◽

Named Entity Recognition ◽

Entity Recognition ◽

Hub Genes ◽

Targeted Interventions ◽

Protein Protein Interaction ◽

Named Entity ◽

Pubmed Database ◽

Gene Network Analysis

This study is aimed at systematically characterizing the endometriosis-associated genes based on text mining and at annotating the functions, pathways, and networks of endometriosis-associated hub genes. We extracted endometriosis-associated abstracts published between 1970 and 2020 from the PubMed database. A neural-named entity recognition and multitype normalization tool for biomedical text mining was used to recognize and normalize the genes and proteins embedded in the abstracts. Gene Ontology and Kyoto Encyclopedia of Genes and Genomes pathway enrichment analyses were conducted to annotate the functions and pathways of recognized genes. Protein-protein interaction analysis was conducted on the genes significantly cooccurring with endometriosis to identify the endometriosis-associated hub genes. A total of 433 genes were recognized as endometriosis-associated genes ( P < 0.05 ), and 154 pathways were significantly enriched ( P < 0.05 ). A network of endometriosis-associated genes with 278 gene nodes and 987 interaction links was established. The 15 proteins that interacted with 20 or more other proteins were identified as the hub proteins of the endometriosis-associated protein network. This study provides novel insights into the hub genes that play key roles in the development of endometriosis and have implications for developing targeted interventions for endometriosis.

Download Full-text