scholarly journals Identifying protein subcellular localisation in scientific literature using bidirectional deep recurrent neural network

2020 ◽  
Author(s):  
Rakesh David ◽  
Rhys-Joshua D. Menezes ◽  
Jan De Klerk ◽  
Ian R. Castleden ◽  
Cornelia M. Hooper ◽  
...  

AbstractWith the advent of increased diversity and scale of molecular data, there has been a growing appreciation for the applications of machine learning and statistical methodologies to gain new biological insights. An important step in achieving this aim is the Relation Extraction process which specifies if an interaction exists between two or more biological entities in a published study. Here, we employed natural-language processing (CBOW) and deep Recurrent Neural Network (bi-directional LSTM) to predict relations between biological entities that describe protein subcellular localisation in plants. We applied our system to 1700 published Arabidopsis protein subcellular studies from the SUBA manually curated dataset. The system was able to extract relevant text and the classifier predicted interactions between protein name, subcellular localisation and experimental methodology. It obtained a final precision, recall rate, accuracy and F1 scores of 0.951, 0.828, 0.893 and 0.884 respectively. The classifier was subsequently tested on a similar problem in crop species (CropPAL) and demonstrated a comparable accuracy measure (0.897). Consequently, our approach can be used to extract protein functional features from unstructured text in the literature with high accuracy. The developed system will improve dissemination or protein functional data to the scientific community and unlock the potential of big data text analytics for generating new hypotheses from diverse datasets.

2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Rakesh David ◽  
Rhys-Joshua D. Menezes ◽  
Jan De Klerk ◽  
Ian R. Castleden ◽  
Cornelia M. Hooper ◽  
...  

AbstractThe increased diversity and scale of published biological data has to led to a growing appreciation for the applications of machine learning and statistical methodologies to gain new insights. Key to achieving this aim is solving the Relationship Extraction problem which specifies the semantic interaction between two or more biological entities in a published study. Here, we employed two deep neural network natural language processing (NLP) methods, namely: the continuous bag of words (CBOW), and the bi-directional long short-term memory (bi-LSTM). These methods were employed to predict relations between entities that describe protein subcellular localisation in plants. We applied our system to 1700 published Arabidopsis protein subcellular studies from the SUBA manually curated dataset. The system combines pre-processing of full-text articles in a machine-readable format with relevant sentence extraction for downstream NLP analysis. Using the SUBA corpus, the neural network classifier predicted interactions between protein name, subcellular localisation and experimental methodology with an average precision, recall rate, accuracy and F1 scores of 95.1%, 82.8%, 89.3% and 88.4% respectively (n = 30). Comparable scoring metrics were obtained using the CropPAL database as an independent testing dataset that stores protein subcellular localisation in crop species, demonstrating wide applicability of prediction model. We provide a framework for extracting protein functional features from unstructured text in the literature with high accuracy, improving data dissemination and unlocking the potential of big data text analytics for generating new hypotheses.


2020 ◽  
Author(s):  
Ramachandro Majji

BACKGROUND Cancer is one of the deadly diseases prevailing worldwide and the patients with cancer are rescued only when the cancer is detected at the very early stage. Early detection of cancer is essential as, in the final stage, the chance of survival is limited. The symptoms of cancers are rigorous and therefore, all the symptoms should be studied properly before the diagnosis. OBJECTIVE Propose an automatic prediction system for classifying cancer to malignant or benign. METHODS This paper introduces the novel strategy based on the JayaAnt lion optimization-based Deep recurrent neural network (JayaALO-based DeepRNN) for cancer classification. The steps followed in the developed model are data normalization, data transformation, feature dimension detection, and classification. The first step is the data normalization. The goal of data normalization is to eliminate data redundancy and to mitigate the storage of objects in a relational database that maintains the same information in several places. After that, the data transformation is carried out based on log transformation that generates the patterns using more interpretable and helps fulfill the supposition, and to reduce skew. Also, the non-negative matrix factorization is employed for reducing the feature dimension. Finally, the proposed JayaALO-based DeepRNN method effectively classifies cancer-based on the reduced dimension features to produce a satisfactory result. RESULTS The proposed JayaALO-based DeepRNN showed improved results with maximal accuracy of 95.97%, the maximal sensitivity of 95.95%, and the maximal specificity of 96.96%. CONCLUSIONS The resulted output of the proposed JayaALO-based DeepRNN is used for cancer classification.


Sign in / Sign up

Export Citation Format

Share Document