Recognizing software names in biomedical literature using machine learning

Software tools now are essential to research and applications in the biomedical domain. However, existing software repositories are mainly built using manual curation, which is time-consuming and unscalable. This study took the initiative to manually annotate software names in 1,120 MEDLINE abstracts and titles and used this corpus to develop and evaluate machine learning–based named entity recognition systems for biomedical software. Specifically, two strategies were proposed for feature engineering: (1) domain knowledge features and (2) unsupervised word representation features of clustered and binarized word embeddings. Our best system achieved an F-measure of 91.79% for recognizing software from titles and an F-measure of 86.35% for recognizing software from both titles and abstracts using inexact matching criteria. We then created a biomedical software catalog with 19,557 entries using the developed system. This study demonstrates the feasibility of using natural language processing methods to automatically build a high-quality software index from biomedical literature.

Download Full-text

Evaluating Word Representation Features in Biomedical Named Entity Recognition Tasks

BioMed Research International ◽

10.1155/2014/240403 ◽

2014 ◽

Vol 2014 ◽

pp. 1-6 ◽

Cited By ~ 49

Author(s):

Buzhou Tang ◽

Hongxin Cao ◽

Xiaolong Wang ◽

Qingcai Chen ◽

Hua Xu

Keyword(s):

Machine Learning ◽

Language Processing ◽

Named Entity Recognition ◽

Entity Recognition ◽

Biomedical Domain ◽

Crucial Step ◽

Named Entity ◽

Different Types ◽

Word Representation ◽

Biomedical Named Entity Recognition

Biomedical Named Entity Recognition (BNER), which extracts important entities such as genes and proteins, is a crucial step of natural language processing in the biomedical domain. Various machine learning-based approaches have been applied to BNER tasks and showed good performance. In this paper, we systematically investigated three different types of word representation (WR) features for BNER, including clustering-based representation, distributional representation, and word embeddings. We selected one algorithm from each of the three types of WR features and applied them to the JNLPBA and BioCreAtIvE II BNER tasks. Our results showed that all the three WR algorithms were beneficial to machine learning-based BNER systems. Moreover, combining these different types of WR features further improved BNER performance, indicating that they are complementary to each other. By combining all the three types of WR features, the improvements inF-measure on the BioCreAtIvE II GM and JNLPBA corpora were 3.75% and 1.39%, respectively, when compared with the systems using baseline features. To the best of our knowledge, this is the first study to systematically evaluate the effect of three different types of WR features for BNER tasks.

Download Full-text

An Annotated Corpus of Crime-Related Portuguese Documents for NLP and Machine Learning Processing

Data ◽

10.3390/data6070071 ◽

2021 ◽

Vol 6 (7) ◽

pp. 71

Author(s):

Gonçalo Carnaz ◽

Mário Antunes ◽

Vitor Beires Nogueira

Keyword(s):

Machine Learning ◽

Language Processing ◽

Named Entity Recognition ◽

Entity Recognition ◽

Automatic Identification ◽

Named Entities ◽

Related Data ◽

Named Entity ◽

Chain Of Custody ◽

Evidence Collection

Criminal investigations collect and analyze the facts related to a crime, from which the investigators can deduce evidence to be used in court. It is a multidisciplinary and applied science, which includes interviews, interrogations, evidence collection, preservation of the chain of custody, and other methods and techniques of investigation. These techniques produce both digital and paper documents that have to be carefully analyzed to identify correlations and interactions among suspects, places, license plates, and other entities that are mentioned in the investigation. The computerized processing of these documents is a helping hand to the criminal investigation, as it allows the automatic identification of entities and their relations, being some of which difficult to identify manually. There exists a wide set of dedicated tools, but they have a major limitation: they are unable to process criminal reports in the Portuguese language, as an annotated corpus for that purpose does not exist. This paper presents an annotated corpus, composed of a collection of anonymized crime-related documents, which were extracted from official and open sources. The dataset was produced as the result of an exploratory initiative to collect crime-related data from websites and conditioned-access police reports. The dataset was evaluated and a mean precision of 0.808, recall of 0.722, and F1-score of 0.733 were obtained with the classification of the annotated named-entities present in the crime-related documents. This corpus can be employed to benchmark Machine Learning (ML) and Natural Language Processing (NLP) methods and tools to detect and correlate entities in the documents. Some examples are sentence detection, named-entity recognition, and identification of terms related to the criminal domain.

Download Full-text

Obtaining Knowledge in Pathology Reports Through a Natural Language Processing Approach With Classification, Named-Entity Recognition, and Relation-Extraction Heuristics

JCO Clinical Cancer Informatics ◽

10.1200/cci.19.00008 ◽

2019 ◽

pp. 1-8 ◽

Cited By ~ 2

Author(s):

Tomasz Oliwa ◽

Steven B. Maron ◽

Leah M. Chase ◽

Samantha Lomnicki ◽

Daniel V.T. Catenacci ◽

...

Keyword(s):

Machine Learning ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Named Entity Recognition ◽

Entity Recognition ◽

Classification Model ◽

Supervised Machine Learning ◽

Named Entity ◽

Pathology Reports

PURPOSE Robust institutional tumor banks depend on continuous sample curation or else subsequent biopsy or resection specimens are overlooked after initial enrollment. Curation automation is hindered by semistructured free-text clinical pathology notes, which complicate data abstraction. Our motivation is to develop a natural language processing method that dynamically identifies existing pathology specimen elements necessary for locating specimens for future use in a manner that can be re-implemented by other institutions. PATIENTS AND METHODS Pathology reports from patients with gastroesophageal cancer enrolled in The University of Chicago GI oncology tumor bank were used to train and validate a novel composite natural language processing-based pipeline with a supervised machine learning classification step to separate notes into internal (primary review) and external (consultation) reports; a named-entity recognition step to obtain label (accession number), location, date, and sublabels (block identifiers); and a results proofreading step. RESULTS We analyzed 188 pathology reports, including 82 internal reports and 106 external consult reports, and successfully extracted named entities grouped as sample information (label, date, location). Our approach identified up to 24 additional unique samples in external consult notes that could have been overlooked. Our classification model obtained 100% accuracy on the basis of 10-fold cross-validation. Precision, recall, and F1 for class-specific named-entity recognition models show strong performance. CONCLUSION Through a combination of natural language processing and machine learning, we devised a re-implementable and automated approach that can accurately extract specimen attributes from semistructured pathology notes to dynamically populate a tumor registry.

Download Full-text

SCIENTIFIC NAMED ENTITY RECOGNITION WITH THE HELP OF MODERN METHODS

Bulletin Series of Physics & Mathematical Sciences ◽

10.51889/2021-3.1728-7901.11 ◽

2021 ◽

Vol 75 (3) ◽

pp. 94-99

Author(s):

A.M. Yelenov ◽

◽

A.B. Jaxylykova ◽

Keyword(s):

Machine Learning ◽

Language Processing ◽

Named Entity Recognition ◽

Recognition Task ◽

Entity Recognition ◽

Support Vector ◽

Scientific Article ◽

Natural Languages ◽

Named Entity ◽

Learning Area

This research focuses on a comparative study of the Named Entity Recognition task for scientific article texts. Natural language processing could be considered as one of the cornerstones in the machine learning area which devotes its attention to the problems connected with the understanding of different natural languages and linguistic analysis. It was already shown that current deep learning techniques have a good performance and accuracy in such areas as image recognition, pattern recognition, computer vision, that could mean that such technology probably would be successful in the neuro-linguistic programming area too and lead to a dramatic increase on the research interest on this topic. For a very long time, quite trivial algorithms have been used in this area, such as support vector machines or various types of regression, basic encoding on text data was also used, which did not provide high results. The following dataset was used to process the experiment models: Dataset Scientific Entity Relation Core. The algorithms used were Long short-term memory, Random Forest Classifier with Conditional Random Fields, and Named-entity recognition with Bidirectional Encoder Representations from Transformers. In the findings, the metrics scores of all models were compared to each other to make a comparison. This research is devoted to the processing of scientific articles, concerning the machine learning area, because the subject is not investigated on enough properly level.The consideration of this task can help machines to understand natural languages better, so that they can solve other neuro-linguistic programming tasks better, enhancing scores in common sense.

Download Full-text

Comparing general and specialized word embeddings for biomedical named entity recognition

PeerJ Computer Science ◽

10.7717/peerj-cs.384 ◽

2021 ◽

Vol 7 ◽

pp. e384

Author(s):

Rigo E. Ramos-Vargas ◽

Israel Román-Godínez ◽

Sulema Torres-Ramos

Keyword(s):

Named Entity Recognition ◽

Biomedical Literature ◽

Word Embedding ◽

The Other ◽

Entity Recognition ◽

Word Embeddings ◽

Named Entity ◽

Word Representation ◽

The One ◽

Biomedical Named Entity Recognition

Increased interest in the use of word embeddings, such as word representation, for biomedical named entity recognition (BioNER) has highlighted the need for evaluations that aid in selecting the best word embedding to be used. One common criterion for selecting a word embedding is the type of source from which it is generated; that is, general (e.g., Wikipedia, Common Crawl), or specific (e.g., biomedical literature). Using specific word embeddings for the BioNER task has been strongly recommended, considering that they have provided better coverage and semantic relationships among medical entities. To the best of our knowledge, most studies have focused on improving BioNER task performance by, on the one hand, combining several features extracted from the text (for instance, linguistic, morphological, character embedding, and word embedding itself) and, on the other, testing several state-of-the-art named entity recognition algorithms. The latter, however, do not pay great attention to the influence of the word embeddings, and do not facilitate observing their real impact on the BioNER task. For this reason, the present study evaluates three well-known NER algorithms (CRF, BiLSTM, BiLSTM-CRF) with respect to two corpora (DrugBank and MedLine) using two classic word embeddings, GloVe Common Crawl (of the general type) and Pyysalo PM + PMC (specific), as unique features. Furthermore, three contextualized word embeddings (ELMo, Pooled Flair, and Transformer) are compared in their general and specific versions. The aim is to determine whether general embeddings can perform better than specialized ones on the BioNER task. To this end, four experiments were designed. In the first, we set out to identify the combination of classic word embedding, NER algorithm, and corpus that results in the best performance. The second evaluated the effect of the size of the corpus on performance. The third assessed the semantic cohesiveness of the classic word embeddings and their correlation with respect to several gold standards; while the fourth evaluates the performance of general and specific contextualized word embeddings on the BioNER task. Results show that the classic general word embedding GloVe Common Crawl performed better in the DrugBank corpus, despite having less word coverage and a lower internal semantic relationship than the classic specific word embedding, Pyysalo PM + PMC; while in the contextualized word embeddings the best results are presented in the specific ones. We conclude, therefore, when using classic word embeddings as features on the BioNER task, the general ones could be considered a good option. On the other hand, when using contextualized word embeddings, the specific ones are the best option.

Download Full-text

Morphological Skip-Gram: Replacing FastText characters n-gram with morphological knowledge

INTELIGENCIA ARTIFICIAL ◽

10.4114/intartif.vol24iss67pp1-17 ◽

2021 ◽

Vol 24 (67) ◽

pp. 1-17

Author(s):

Flávio Arthur O. Santos ◽

Thiago Dias Bispo ◽

Hendrik Teixeira Macedo ◽

Cleber Zanchettin

Keyword(s):

Language Processing ◽

Hate Speech ◽

Named Entity Recognition ◽

Entity Recognition ◽

Training Phase ◽

Word Embeddings ◽

Speech Detection ◽

Word Representation ◽

N Gram ◽

Good Word

Natural language processing systems have attracted much interest of the industry. This branch of study is composed of some applications such as machine translation, sentiment analysis, named entity recognition, question and answer, and others. Word embeddings (i.e., continuous word representations) are an essential module for those applications generally used as word representation to machine learning models. Some popular methods to train word embeddings are GloVe and Word2Vec. They achieve good word representations, despite limitations: both ignore morphological information of the words and consider only one representation vector for each word. This approach implies the word embeddings does not consider different word contexts properly and are unaware of its inner structure. To mitigate this problem, the other word embeddings method FastText represents each word as a bag of characters n-grams. Hence, a continuous vector describes each n-gram, and the final word representation is the sum of its characters n-grams vectors. Nevertheless, the use of all n-grams character of a word is a poor approach since some n-grams have no semantic relation with their words and increase the amount of potentially useless information. This approach also increase the training phase time. In this work, we propose a new method for training word embeddings, and its goal is to replace the FastText bag of character n-grams for a bag of word morphemes through the morphological analysis of the word. Thus, words with similar context and morphemes are represented by vectors close to each other. To evaluate our new approach, we performed intrinsic evaluations considering 15 different tasks, and the results show a competitive performance compared to FastText. Moreover, the proposed model is $40\%$ faster than FastText in the training phase. We also outperform the baseline approaches in extrinsic evaluations through Hate speech detection and NER tasks using different scenarios.

Download Full-text

Natural Language Processing Model for Automatic Analysis of Cybersecurity-Related Documents

Symmetry ◽

10.3390/sym12030354 ◽

2020 ◽

Vol 12 (3) ◽

pp. 354

Author(s):

Tiberiu-Marian Georgescu

Keyword(s):

Machine Learning ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Web Application ◽

Named Entity Recognition ◽

Relation Extraction ◽

Entity Recognition ◽

Model Based ◽

The Way

This paper describes the development and implementation of a natural language processing model based on machine learning which performs cognitive analysis for cybersecurity-related documents. A domain ontology was developed using a two-step approach: (1) the symmetry stage and (2) the machine adjustment. The first stage is based on the symmetry between the way humans represent a domain and the way machine learning solutions do. Therefore, the cybersecurity field was initially modeled based on the expertise of cybersecurity professionals. A dictionary of relevant entities was created; the entities were classified into 29 categories and later implemented as classes in a natural language processing model based on machine learning. After running successive performance tests, the ontology was remodeled from 29 to 18 classes. Using the ontology, a natural language processing model based on a supervised learning model was defined. We trained the model using sets of approximately 300,000 words. Remarkably, our model obtained an F1 score of 0.81 for named entity recognition and 0.58 for relation extraction, showing superior results compared to other similar models identified in the literature. Furthermore, in order to be easily used and tested, a web application that integrates our model as the core component was developed.

Download Full-text

Predictive article recommendation using natural language processing and machine learning to support evidence updates in domain-specific knowledge graphs

JAMIA Open ◽

10.1093/jamiaopen/ooaa028 ◽

2020 ◽

Vol 3 (3) ◽

pp. 332-337

Author(s):

Bhuvan Sharma ◽

Van C Willis ◽

Claudia S Huettner ◽

Kirk Beaty ◽

Jane L Snowdon ◽

...

Keyword(s):

Machine Learning ◽

Language Processing ◽

Named Entity Recognition ◽

Entity Recognition ◽

Human Cognition ◽

Current Evidence ◽

Knowledge Graph ◽

Space Modeling ◽

Domain Specific Knowledge ◽

Knowledge Graphs

Abstract Objectives Describe an augmented intelligence approach to facilitate the update of evidence for associations in knowledge graphs. Methods New publications are filtered through multiple machine learning study classifiers, and filtered publications are combined with articles already included as evidence in the knowledge graph. The corpus is then subjected to named entity recognition, semantic dictionary mapping, term vector space modeling, pairwise similarity, and focal entity match to identify highly related publications. Subject matter experts review recommended articles to assess inclusion in the knowledge graph; discrepancies are resolved by consensus. Results Study classifiers achieved F-scores from 0.88 to 0.94, and similarity thresholds for each study type were determined by experimentation. Our approach reduces human literature review load by 99%, and over the past 12 months, 41% of recommendations were accepted to update the knowledge graph. Conclusion Integrated search and recommendation exploiting current evidence in a knowledge graph is useful for reducing human cognition load.

Download Full-text

Applying Citizen Science to Gene, Drug, and Disease Relationship Extraction from Biomedical Abstracts

Bioinformatics ◽

10.1093/bioinformatics/btz678 ◽

2019 ◽

Cited By ~ 1

Author(s):

Ginger Tsueng ◽

Max Nanis ◽

Jennifer T Fouquier ◽

Michael Mayers ◽

Benjamin M Good ◽

...

Keyword(s):

Information Extraction ◽

Citizen Science ◽

Computational Methods ◽

Language Processing ◽

Named Entity Recognition ◽

Biomedical Literature ◽

Entity Recognition ◽

Supplementary Information ◽

Named Entity ◽

Relationship Extraction

Abstract Motivation Biomedical literature is growing at a rate that outpaces our ability to harness the knowledge contained therein. To mine valuable inferences from the large volume of literature, many researchers use information extraction algorithms to harvest information in biomedical texts. Information extraction is usually accomplished via a combination of manual expert curation and computational methods. Advances in computational methods usually depends on the time-consuming generation of gold standards by a limited number of expert curators. Citizen science is public participation in scientific research. We previously found that citizen scientists are willing and capable of performing named entity recognition of disease mentions in biomedical abstracts, but did not know if this was true with relationship extraction. Results In this paper, we introduce the Relationship Extraction Module of the web-based application Mark2Cure and demonstrate that citizen scientists can perform relationship extraction. We confirm the importance of accurate named entity recognition on user performance of relationship extraction and identify design issues that impacted data quality. We find that the data generated by citizen scientists can be used to identify relationship types not currently available in the Mark2Cure Relationship Extraction Module. We compare the citizen science-generated data with algorithm-mined data and identify ways in which the two approaches may complement one another. We also discuss opportunities for future improvement of this system, as well as the potential synergies between citizen science, manual biocuration, and natural language processing. Availability Mark2Cure platform: https://mark2cure.org. Mark2Cure source code: https://github.com/sulab/mark2cure Data and analysis code for this paper: https://github.com/gtsueng/M2C_rel_nb Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Chemical Entity Recognition and Resolution to ChEBI

ISRN Bioinformatics ◽

10.5402/2012/619427 ◽

2012 ◽

Vol 2012 ◽

pp. 1-9 ◽

Cited By ~ 10

Author(s):

Tiago Grego ◽

Catia Pesquita ◽

Hugo P. Bastos ◽

Francisco M. Couto

Keyword(s):

Machine Learning ◽

Named Entity Recognition ◽

Recognition Task ◽

Entity Resolution ◽

Chemical Entity ◽

Biomedical Literature ◽

Entity Recognition ◽

Named Entity ◽

Lexical Similarity ◽

Recognition Systems

Chemical entities are ubiquitous through the biomedical literature and the development of text-mining systems that can efficiently identify those entities are required. Due to the lack of available corpora and data resources, the community has focused its efforts in the development of gene and protein named entity recognition systems, but with the release of ChEBI and the availability of an annotated corpus, this task can be addressed. We developed a machine-learning-based method for chemical entity recognition and a lexical-similarity-based method for chemical entity resolution and compared them with Whatizit, a popular-dictionary-based method. Our methods outperformed the dictionary-based method in all tasks, yielding an improvement in F-measure of 20% for the entity recognition task, 2–5% for the entity-resolution task, and 15% for combined entity recognition and resolution tasks.

Download Full-text