scholarly journals Evaluating Named-Entity Recognition approaches in plant molecular biology

2018 ◽  
Author(s):  
Huy Do ◽  
Khoat Than ◽  
Pierre Larmande

AbstractText mining research is becoming an important topic in biology with the aim to extract biological entities from scientific papers in order to extend the biological knowledge. However, few thorough studies on text mining and applications are developed for plant molecular biology data, especially rice, thus resulting a lack of datasets available to train models able to detect entities such as genes, proteins and phenotypic traits. Since there is rare benchmarks for rice, we have to face various difficulties in exploiting advanced machine learning methods for accurate analysis of rice bibliography. In this article, we developed a new training datasets (Oryzabase) as the benchmark. Then, we evaluated the performance of several current approaches to find a methodology with the best results and assigned it as the state of the art method for our own technique in the future. We applied Name Entities Recognition (NER) tagger, which is built from a Long Short Term Memory (LSTM) model, and combined with Conditional Random Fields (CRFs) to extract information of rice genes and proteins. We analyzed the performance of LSTM-CRF when applying to the Oryzabase dataset and improved the results up to 86% in F1. We found that on average, the result from LSTM-CRF is more exploitable with the new benchmark.

2012 ◽  
Vol 198-199 ◽  
pp. 1345-1350
Author(s):  
Zhao Hui Wang ◽  
Wei Huang

The rapid growth of biological texts promotes the study of text mining which focuses on mining biological knowledge in various unstructured documents. Meanwhile, most biological text mining efforts are based on identifying biological terms such as gene and protein names. Therefore, how to identify biological terms effectively from text has become one of the important problems in bioinformatics. Conditional random fields (CRFs), an important machine learning algorithm, are graphical models for modeling the probability of labels given the observations. They have traditionally been trained with using a set of observation and label pairs. Here we use CRFs in a class of temporal learning algorithms, reinforcement learning. Consequently the labels are actions that update the environment and affect the next observation. As a result, from the view of reinforcement learning, CRFs provide a way to model joint actions in a decentralized Markov decision process, which define how agents can communicate with each other to choose the optimal joint action. We use GENIA corpus to carry on training and testing the proposed approach. The result showed the system could find out biological terms from texts effectively. We get average precision rate=90.8%, average recall rate=90.6%, and average F1 rate=90.6% on six classes of biological terms. The results are pretty better than many other biological named entity recognition systems.


2020 ◽  
Vol 39 (2) ◽  
pp. 2015-2025
Author(s):  
Orlando Ramos-Flores ◽  
David Pinto ◽  
Manuel Montes-y-Gómez ◽  
Andrés Vázquez

This work presents an experimental study on the task of Named Entity Recognition (NER) for a narrow domain in Spanish language. This study considers two approaches commonly used in this kind of problem, namely, a Conditional Random Fields (CRF) model and Recurrent Neural Network (RNN). For the latter, we employed a bidirectional Long Short-Term Memory with ELMO’s pre-trained word embeddings for Spanish. The comparison between the probabilistic model and the deep learning model was carried out in two collections, the Spanish dataset from CoNLL-2002 considering four classes under the IOB tagging schema, and a Mexican Spanish news dataset with seventeen classes under IOBES schema. The paper presents an analysis about the scalability, robustness, and common errors of both models. This analysis indicates in general that the BiLSTM-ELMo model is more suitable than the CRF model when there is “enough” training data, and also that it is more scalable, as its performance was not significantly affected in the incremental experiments (by adding one class at a time). On the other hand, results indicate that the CRF model is more adequate for scenarios having small training datasets and many classes.


Entropy ◽  
2020 ◽  
Vol 22 (2) ◽  
pp. 252 ◽  
Author(s):  
Min Zhang ◽  
Guohua Geng ◽  
Jing Chen

Increasingly, popular online museums have significantly changed the way people acquire cultural knowledge. These online museums have been generating abundant amounts of cultural relics data. In recent years, researchers have used deep learning models that can automatically extract complex features and have rich representation capabilities to implement named-entity recognition (NER). However, the lack of labeled data in the field of cultural relics makes it difficult for deep learning models that rely on labeled data to achieve excellent performance. To address this problem, this paper proposes a semi-supervised deep learning model named SCRNER (Semi-supervised model for Cultural Relics’ Named Entity Recognition) that utilizes the bidirectional long short-term memory (BiLSTM) and conditional random fields (CRF) model trained by seldom labeled data and abundant unlabeled data to attain an effective performance. To satisfy the semi-supervised sample selection, we propose a repeat-labeled (relabeled) strategy to select samples of high confidence to enlarge the training set iteratively. In addition, we use embeddings from language model (ELMo) representations to dynamically acquire word representations as the input of the model to solve the problem of the blurred boundaries of cultural objects and Chinese characteristics of texts in the field of cultural relics. Experimental results demonstrate that our proposed model, trained on limited labeled data, achieves an effective performance in the task of named entity recognition of cultural relics.


Metabolites ◽  
2021 ◽  
Vol 11 (10) ◽  
pp. 678
Author(s):  
Jonathan Dekermanjian ◽  
Wladimir Labeikovsky ◽  
Debashis Ghosh ◽  
Katerina Kechris

The bottleneck for taking full advantage of metabolomics data is often the availability, awareness, and usability of analysis tools. Software tools specifically designed for metabolomics data are being developed at an increasing rate, with hundreds of available tools already in the literature. Many of these tools are open-source and freely available but are very diverse with respect to language, data formats, and stages in the metabolomics pipeline. To help mitigate the challenges of meeting the increasing demand for guidance in choosing analytical tools and coordinating the adoption of best practices for reproducibility, we have designed and built the MSCAT (Metabolomics Software CATalog) database of metabolomics software tools that can be sustainably and continuously updated. This database provides a survey of the landscape of available tools and can assist researchers in their selection of data analysis workflows for metabolomics studies according to their specific needs. We used machine learning (ML) methodology for the purpose of semi-automating the identification of metabolomics software tool names within abstracts. MSCAT searches the literature to find new software tools by implementing a Named Entity Recognition (NER) model based on a neural network model at the sentence level composed of a character-level convolutional neural network (CNN) combined with a bidirectional long-short-term memory (LSTM) layer and a conditional random fields (CRF) layer. The list of potential new tools (and their associated publication) is then forwarded to the database maintainer for the curation of the database entry corresponding to the tool. The end-user interface allows for filtering of tools by multiple characteristics as well as plotting of the aggregate tool data to monitor the metabolomics software landscape.


2017 ◽  
Vol 22 (S3) ◽  
pp. 5195-5206 ◽  
Author(s):  
Shengli Song ◽  
Nan Zhang ◽  
Haitao Huang

2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Nícia Rosário-Ferreira ◽  
Victor Guimarães ◽  
Vítor S. Costa ◽  
Irina S. Moreira

Abstract Background Blood cancers (BCs) are responsible for over 720 K yearly deaths worldwide. Their prevalence and mortality-rate uphold the relevance of research related to BCs. Despite the availability of different resources establishing Disease-Disease Associations (DDAs), the knowledge is scattered and not accessible in a straightforward way to the scientific community. Here, we propose SicknessMiner, a biomedical Text-Mining (TM) approach towards the centralization of DDAs. Our methodology encompasses Named Entity Recognition (NER) and Named Entity Normalization (NEN) steps, and the DDAs retrieved were compared to the DisGeNET resource for qualitative and quantitative comparison. Results We obtained the DDAs via co-mention using our SicknessMiner or gene- or variant-disease similarity on DisGeNET. SicknessMiner was able to retrieve around 92% of the DisGeNET results and nearly 15% of the SicknessMiner results were specific to our pipeline. Conclusions SicknessMiner is a valuable tool to extract disease-disease relationship from RAW input corpus.


2016 ◽  
Vol 2016 ◽  
pp. 1-9 ◽  
Author(s):  
Abbas Akkasi ◽  
Ekrem Varoğlu ◽  
Nazife Dimililer

Named Entity Recognition (NER) from text constitutes the first step in many text mining applications. The most important preliminary step for NER systems using machine learning approaches is tokenization where raw text is segmented into tokens. This study proposes an enhanced rule based tokenizer, ChemTok, which utilizes rules extracted mainly from the train data set. The main novelty of ChemTok is the use of the extracted rules in order to merge the tokens split in the previous steps, thus producing longer and more discriminative tokens. ChemTok is compared to the tokenization methods utilized by ChemSpot and tmChem. Support Vector Machines and Conditional Random Fields are employed as the learning algorithms. The experimental results show that the classifiers trained on the output of ChemTok outperforms all classifiers trained on the output of the other two tokenizers in terms of classification performance, and the number of incorrectly segmented entities.


Sign in / Sign up

Export Citation Format

Share Document