Evaluating Named-Entity Recognition approaches in plant molecular biology

AbstractText mining research is becoming an important topic in biology with the aim to extract biological entities from scientific papers in order to extend the biological knowledge. However, few thorough studies on text mining and applications are developed for plant molecular biology data, especially rice, thus resulting a lack of datasets available to train models able to detect entities such as genes, proteins and phenotypic traits. Since there is rare benchmarks for rice, we have to face various difficulties in exploiting advanced machine learning methods for accurate analysis of rice bibliography. In this article, we developed a new training datasets (Oryzabase) as the benchmark. Then, we evaluated the performance of several current approaches to find a methodology with the best results and assigned it as the state of the art method for our own technique in the future. We applied Name Entities Recognition (NER) tagger, which is built from a Long Short Term Memory (LSTM) model, and combined with Conditional Random Fields (CRFs) to extract information of rice genes and proteins. We analyzed the performance of LSTM-CRF when applying to the Oryzabase dataset and improved the results up to 86% in F1. We found that on average, the result from LSTM-CRF is more exploitable with the new benchmark.

Download Full-text

Finding Out Biological Terms from Texts with CRFs for Reinforcement Learning

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.198-199.1345 ◽

2012 ◽

Vol 198-199 ◽

pp. 1345-1350

Author(s):

Zhao Hui Wang ◽

Wei Huang

Keyword(s):

Reinforcement Learning ◽

Text Mining ◽

Graphical Models ◽

Conditional Random Fields ◽

Learning Algorithm ◽

Named Entity Recognition ◽

Recall Rate ◽

Entity Recognition ◽

Biological Knowledge ◽

Markov Decision

The rapid growth of biological texts promotes the study of text mining which focuses on mining biological knowledge in various unstructured documents. Meanwhile, most biological text mining efforts are based on identifying biological terms such as gene and protein names. Therefore, how to identify biological terms effectively from text has become one of the important problems in bioinformatics. Conditional random fields (CRFs), an important machine learning algorithm, are graphical models for modeling the probability of labels given the observations. They have traditionally been trained with using a set of observation and label pairs. Here we use CRFs in a class of temporal learning algorithms, reinforcement learning. Consequently the labels are actions that update the environment and affect the next observation. As a result, from the view of reinforcement learning, CRFs provide a way to model joint actions in a decentralized Markov decision process, which define how agents can communicate with each other to choose the optimal joint action. We use GENIA corpus to carry on training and testing the proposed approach. The result showed the system could find out biological terms from texts effectively. We get average precision rate=90.8%, average recall rate=90.6%, and average F1 rate=90.6% on six classes of biological terms. The results are pretty better than many other biological named entity recognition systems.

Download Full-text

Bidirectional Long Short-Term Memory (BILSTM) with Conditional Random Fields (CRF) for Knowledge Named Entity Recognition in Online Judges (OJS)

International Journal on Natural Language Computing ◽

10.5121/ijnlc.2018.7401 ◽

2018 ◽

Vol 7 (4) ◽

pp. 01-08

Author(s):

Muhammad Asif Khan ◽

Tayyab Naveed ◽

Elmaam Yagoub ◽

Guojin Zhu

Keyword(s):

Random Fields ◽

Conditional Random Fields ◽

Short Term Memory ◽

Named Entity Recognition ◽

Entity Recognition ◽

Short Term ◽

Term Memory ◽

Named Entity ◽

Long Short Term Memory

Download Full-text

Cybersecurity named entity recognition using bidirectional long short-term memory with conditional random fields

Tsinghua Science & Technology ◽

10.26599/tst.2019.9010033 ◽

2021 ◽

Vol 26 (3) ◽

pp. 259-265

Author(s):

Pingchuan Ma ◽

Bo Jiang ◽

Zhigang Lu ◽

Ning Li ◽

Zhengwei Jiang

Keyword(s):

Random Fields ◽

Conditional Random Fields ◽

Short Term Memory ◽

Named Entity Recognition ◽

Entity Recognition ◽

Short Term ◽

Term Memory ◽

Named Entity ◽

Long Short Term Memory

Download Full-text

Evaluating Named-Entity Recognition Approaches in Plant Molecular Biology

Lecture Notes in Computer Science - Multi-disciplinary Trends in Artificial Intelligence ◽

10.1007/978-3-030-03014-8_19 ◽

2018 ◽

pp. 219-225 ◽

Cited By ~ 2

Author(s):

Huy Do ◽

Khoat Than ◽

Pierre Larmande

Keyword(s):

Molecular Biology ◽

Named Entity Recognition ◽

Plant Molecular Biology ◽

Entity Recognition ◽

Named Entity

Download Full-text

Probabilistic vs deep learning based approaches for narrow domain NER in Spanish

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-179868 ◽

2020 ◽

Vol 39 (2) ◽

pp. 2015-2025

Author(s):

Orlando Ramos-Flores ◽

David Pinto ◽

Manuel Montes-y-Gómez ◽

Andrés Vázquez

Keyword(s):

Deep Learning ◽

Conditional Random Fields ◽

Short Term Memory ◽

Named Entity Recognition ◽

Training Data ◽

Entity Recognition ◽

Mexican Spanish ◽

Named Entity ◽

Long Short Term Memory ◽

Deep Learning Model

This work presents an experimental study on the task of Named Entity Recognition (NER) for a narrow domain in Spanish language. This study considers two approaches commonly used in this kind of problem, namely, a Conditional Random Fields (CRF) model and Recurrent Neural Network (RNN). For the latter, we employed a bidirectional Long Short-Term Memory with ELMO’s pre-trained word embeddings for Spanish. The comparison between the probabilistic model and the deep learning model was carried out in two collections, the Spanish dataset from CoNLL-2002 considering four classes under the IOB tagging schema, and a Mexican Spanish news dataset with seventeen classes under IOBES schema. The paper presents an analysis about the scalability, robustness, and common errors of both models. This analysis indicates in general that the BiLSTM-ELMo model is more suitable than the CRF model when there is “enough” training data, and also that it is more scalable, as its performance was not significantly affected in the incremental experiments (by adding one class at a time). On the other hand, results indicate that the CRF model is more adequate for scenarios having small training datasets and many classes.

Download Full-text

Semi-Supervised Bidirectional Long Short-Term Memory and Conditional Random Fields Model for Named-Entity Recognition Using Embeddings from Language Models Representations

Entropy ◽

10.3390/e22020252 ◽

2020 ◽

Vol 22 (2) ◽

pp. 252 ◽

Cited By ~ 9

Author(s):

Min Zhang ◽

Guohua Geng ◽

Jing Chen

Keyword(s):

Deep Learning ◽

Conditional Random Fields ◽

Short Term Memory ◽

Named Entity Recognition ◽

Entity Recognition ◽

Short Term ◽

Named Entity ◽

Effective Performance ◽

Long Short Term Memory ◽

Cultural Relics

Increasingly, popular online museums have significantly changed the way people acquire cultural knowledge. These online museums have been generating abundant amounts of cultural relics data. In recent years, researchers have used deep learning models that can automatically extract complex features and have rich representation capabilities to implement named-entity recognition (NER). However, the lack of labeled data in the field of cultural relics makes it difficult for deep learning models that rely on labeled data to achieve excellent performance. To address this problem, this paper proposes a semi-supervised deep learning model named SCRNER (Semi-supervised model for Cultural Relics’ Named Entity Recognition) that utilizes the bidirectional long short-term memory (BiLSTM) and conditional random fields (CRF) model trained by seldom labeled data and abundant unlabeled data to attain an effective performance. To satisfy the semi-supervised sample selection, we propose a repeat-labeled (relabeled) strategy to select samples of high confidence to enlarge the training set iteratively. In addition, we use embeddings from language model (ELMo) representations to dynamically acquire word representations as the input of the model to solve the problem of the blurred boundaries of cultural objects and Chinese characteristics of texts in the field of cultural relics. Experimental results demonstrate that our proposed model, trained on limited labeled data, achieves an effective performance in the task of named entity recognition of cultural relics.

Download Full-text

MSCAT: A Machine Learning Assisted Catalog of Metabolomics Software Tools

Metabolites ◽

10.3390/metabo11100678 ◽

2021 ◽

Vol 11 (10) ◽

pp. 678

Author(s):

Jonathan Dekermanjian ◽

Wladimir Labeikovsky ◽

Debashis Ghosh ◽

Katerina Kechris

Keyword(s):

Neural Network ◽

Machine Learning ◽

Conditional Random Fields ◽

Short Term Memory ◽

Named Entity Recognition ◽

Software Tool ◽

Software Tools ◽

Entity Recognition ◽

Metabolomics Data ◽

Multiple Characteristics

The bottleneck for taking full advantage of metabolomics data is often the availability, awareness, and usability of analysis tools. Software tools specifically designed for metabolomics data are being developed at an increasing rate, with hundreds of available tools already in the literature. Many of these tools are open-source and freely available but are very diverse with respect to language, data formats, and stages in the metabolomics pipeline. To help mitigate the challenges of meeting the increasing demand for guidance in choosing analytical tools and coordinating the adoption of best practices for reproducibility, we have designed and built the MSCAT (Metabolomics Software CATalog) database of metabolomics software tools that can be sustainably and continuously updated. This database provides a survey of the landscape of available tools and can assist researchers in their selection of data analysis workflows for metabolomics studies according to their specific needs. We used machine learning (ML) methodology for the purpose of semi-automating the identification of metabolomics software tool names within abstracts. MSCAT searches the literature to find new software tools by implementing a Named Entity Recognition (NER) model based on a neural network model at the sentence level composed of a character-level convolutional neural network (CNN) combined with a bidirectional long-short-term memory (LSTM) layer and a conditional random fields (CRF) layer. The list of potential new tools (and their associated publication) is then forwarded to the database maintainer for the curation of the database entry corresponding to the tool. The end-user interface allows for filtering of tools by multiple characteristics as well as plotting of the aggregate tool data to monitor the metabolomics software landscape.

Download Full-text

Named entity recognition based on conditional random fields

Cluster Computing ◽

10.1007/s10586-017-1146-3 ◽

2017 ◽

Vol 22 (S3) ◽

pp. 5195-5206 ◽

Cited By ~ 4

Author(s):

Shengli Song ◽

Nan Zhang ◽

Haitao Huang

Keyword(s):

Random Fields ◽

Conditional Random Fields ◽

Named Entity Recognition ◽

Entity Recognition ◽

Named Entity

Download Full-text

SicknessMiner: a deep-learning-driven text-mining tool to abridge disease-disease associations

BMC Bioinformatics ◽

10.1186/s12859-021-04397-w ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Nícia Rosário-Ferreira ◽

Victor Guimarães ◽

Vítor S. Costa ◽

Irina S. Moreira

Keyword(s):

Text Mining ◽

Named Entity Recognition ◽

Entity Recognition ◽

Named Entity ◽

Disease Similarity ◽

Disease Associations ◽

Named Entity Normalization ◽

Mining Tool ◽

Or Gene ◽

Text Mining Tool

Abstract Background Blood cancers (BCs) are responsible for over 720 K yearly deaths worldwide. Their prevalence and mortality-rate uphold the relevance of research related to BCs. Despite the availability of different resources establishing Disease-Disease Associations (DDAs), the knowledge is scattered and not accessible in a straightforward way to the scientific community. Here, we propose SicknessMiner, a biomedical Text-Mining (TM) approach towards the centralization of DDAs. Our methodology encompasses Named Entity Recognition (NER) and Named Entity Normalization (NEN) steps, and the DDAs retrieved were compared to the DisGeNET resource for qualitative and quantitative comparison. Results We obtained the DDAs via co-mention using our SicknessMiner or gene- or variant-disease similarity on DisGeNET. SicknessMiner was able to retrieve around 92% of the DisGeNET results and nearly 15% of the SicknessMiner results were specific to our pipeline. Conclusions SicknessMiner is a valuable tool to extract disease-disease relationship from RAW input corpus.

Download Full-text

ChemTok: A New Rule Based Tokenizer for Chemical Named Entity Recognition

BioMed Research International ◽

10.1155/2016/4248026 ◽

2016 ◽

Vol 2016 ◽

pp. 1-9 ◽

Cited By ~ 5

Author(s):

Abbas Akkasi ◽

Ekrem Varoğlu ◽

Nazife Dimililer

Keyword(s):

Conditional Random Fields ◽

Named Entity Recognition ◽

Classification Performance ◽

Entity Recognition ◽

Support Vector ◽

Learning Approaches ◽

Data Set ◽

Rule Based ◽

Named Entity ◽

Vector Machines

Named Entity Recognition (NER) from text constitutes the first step in many text mining applications. The most important preliminary step for NER systems using machine learning approaches is tokenization where raw text is segmented into tokens. This study proposes an enhanced rule based tokenizer, ChemTok, which utilizes rules extracted mainly from the train data set. The main novelty of ChemTok is the use of the extracted rules in order to merge the tokens split in the previous steps, thus producing longer and more discriminative tokens. ChemTok is compared to the tokenization methods utilized by ChemSpot and tmChem. Support Vector Machines and Conditional Random Fields are employed as the learning algorithms. The experimental results show that the classifiers trained on the output of ChemTok outperforms all classifiers trained on the output of the other two tokenizers in terms of classification performance, and the number of incorrectly segmented entities.

Download Full-text