scholarly journals Recognition of Disease Genetic Information from Unstructured Text Data Based on BiLSTM-CRF for Molecular Mechanisms

2021 ◽  
Vol 2021 ◽  
pp. 1-8
Author(s):  
Lejun Gong ◽  
Xingxing Zhang ◽  
Tianyin Chen ◽  
Li Zhang

Disease relevant entities are an important task in mining unstructured text data from the biomedical literature for achieving biomedical knowledge. Autism spectrum disorder (ASD) is a disease related to a neurological and developmental disorder characterized by deficits in communication and social interaction and by repetitive behaviour. However, this kind of disease remains unclear to date. In this study, it identifies entities associated with disease using the machine learning of a computational way from text data collection for molecular mechanisms related to ASD. Entities related to disease are extracted from the biomedical literature related to autism by using deep learning with bidirectional long short-term memory (BiLSTM) and conditional random field (CRF) model. Compared other previous works, the approach is promising for identifying entities related to disease. The proposed approach including five types of molecular entities is evaluated by GENIA corpus to obtain an F-score of 76.81%. The work has extracted 9146 proteins, 145 RNAs, 7680 DNAs, 1058 cell-types, and 981 cell-lines from the autism biomedical literature after removing repeated molecular entities. Finally, we perform GO and KEGG analyses of the test dataset. This study could serve as a reference for further studies on the etiology of disease on the basis of molecular mechanisms and provide a way to explore disease genetic information.

Author(s):  
Logeswari Shanmugam ◽  
Premalatha K.

Biomedical literature is the primary repository of biomedical knowledge in which PubMed is the most absolute database for collecting, organizing and analyzing textual knowledge. The high dimensionality of the natural language text makes the text data quite noisy and sparse in the vector space. Hence, the data preprocessing and feature selection are important processes for the text processing issues. Ontologies select the meaningful terms semantically associated with the concepts from a document to reduce the dimensionality of the original text. In this chapter, semantic-based indexing approaches are proposed with cognitive search which makes use of domain ontology to extract relevant information from big and diverse data sets for users.


2021 ◽  
Author(s):  
Helena Balabin ◽  
Charles Tapley Hoyt ◽  
Colin Birkenbihl ◽  
Benjamin M. Gyori ◽  
John A. Bachman ◽  
...  

The majority of biomedical knowledge is stored in structured databases or as unstructured text in scientific publications. This vast amount of information has led to numerous machine learning-based biological applications using either text through natural language processing (NLP) or structured data through knowledge graph embedding models (KGEMs). However, representations based on a single modality are inherently limited. To generate better representations of biological knowledge, we propose STonKGs, a Sophisticated Transformer trained on biomedical text and Knowledge Graphs. This multimodal Transformer uses combined input sequences of structured information from KGs and unstructured text data from biomedical literature to learn joint representations. First, we pre-trained STonKGs on a knowledge base assembled by the Integrated Network and Dynamical Reasoning Assembler (INDRA) consisting of millions of text-triple pairs extracted from biomedical literature by multiple NLP systems. Then, we benchmarked STonKGs against two baseline models trained on either one of the modalities (i.e., text or KG) across eight different classification tasks, each corresponding to a different biological application. Our results demonstrate that STonKGs outperforms both baselines, especially on the more challenging tasks with respect to the number of classes, improving upon the F1-score of the best baseline by up to 0.083. Additionally, our pre-trained model as well as the model architecture can be adapted to various other transfer learning applications. Finally, the source code and pre-trained STonKGs models are available at https://github.com/stonkgs/stonkgs and https://huggingface.co/stonkgs/stonkgs-150k.


eLife ◽  
2020 ◽  
Vol 9 ◽  
Author(s):  
AJ Venkatakrishnan ◽  
Arjun Puranik ◽  
Akash Anand ◽  
David Zemmour ◽  
Xiang Yao ◽  
...  

The COVID-19 pandemic demands assimilation of all biomedical knowledge to decode mechanisms of pathogenesis. Despite the recent renaissance in neural networks, a platform for the real-time synthesis of the exponentially growing biomedical literature and deep omics insights is unavailable. Here, we present the nferX platform for dynamic inference from over 45 quadrillion possible conceptual associations from unstructured text, and triangulation with insights from single-cell RNA-sequencing, bulk RNA-seq and proteomics from diverse tissue types. A hypothesis-free profiling of ACE2 suggests tongue keratinocytes, olfactory epithelial cells, airway club cells and respiratory ciliated cells as potential reservoirs of the SARS-CoV-2 receptor. We find the gut as the putative hotspot of COVID-19, where a maturation correlated transcriptional signature is shared in small intestine enterocytes among coronavirus receptors (ACE2, DPP4, ANPEP). A holistic data science platform triangulating insights from structured and unstructured data holds potential for accelerating the generation of impactful biological insights and hypotheses.


2018 ◽  
Author(s):  
John M Giorgi ◽  
Gary D Bader

AbstractMotivationThe explosive increase of biomedical literature has made information extraction an increasingly important tool for biomedical research. A fundamental task is the recognition of biomedical named entities in text (BNER) such as genes/proteins, diseases, and species. Recently, a domain-independent method based on deep learning and statistical word embeddings, called long short-term memory network-conditional random field (LSTM-CRF), has been shown to outperform state-of-the-art entity-specific BNER tools. However, this method is dependent on gold-standard corpora (GSCs) consisting of hand-labeled entities, which tend to be small but highly reliable. An alternative to GSCs are silver-standard corpora (SSCs), which are generated by harmonizing the annotations made by several automatic annotation systems. SSCs typically contain more noise than GSCs but have the advantage of containing many more training examples. Ideally, these corpora could be combined to achieve the benefits of both, which is an opportunity for transfer learning. In this work, we analyze to what extent transfer learning improves upon state-of-the-art results for BNER.ResultsWe demonstrate that transferring a deep neural network (DNN) trained on a large, noisy SSC to a smaller, but more reliable GSC significantly improves upon state-of-the-art results for BNER. Compared to a state-of-the-art baseline evaluated on 23 GSCs covering four different entity classes, transfer learning results in an average reduction in error of approximately 11%. We found transfer learning to be especially beneficial for target data sets with a small number of labels (approximately 6000 or less).Availability and implementationSource code for the LSTM-CRF is available at https://github.com/Franck-Dernoncourt/NeuroNER/ and links to the corpora are available at https://github.com/BaderLab/Transfer-Learning-BNER-Bioinformatics-2018/[email protected] informationSupplementary data are available at Bioinformatics online.


Genes ◽  
2021 ◽  
Vol 12 (5) ◽  
pp. 782
Author(s):  
Veronica Tisato ◽  
Juliana A. Silva ◽  
Giovanna Longo ◽  
Ines Gallo ◽  
Ajay V. Singh ◽  
...  

Autism spectrum disorder (ASD) is a complex neurodevelopmental condition affecting behavior and communication, presenting with extremely different clinical phenotypes and features. ASD etiology is composite and multifaceted with several causes and risk factors responsible for different individual disease pathophysiological processes and clinical phenotypes. From a genetic and epigenetic side, several candidate genes have been reported as potentially linked to ASD, which can be detected in about 10–25% of patients. Folate gene polymorphisms have been previously associated with other psychiatric and neurodegenerative diseases, mainly focused on gene variants in the DHFR gene (5q14.1; rs70991108, 19bp ins/del), MTHFR gene (1p36.22; rs1801133, C677T and rs1801131, A1298C), and CBS gene (21q22.3; rs876657421, 844ins68). Of note, their roles have been scarcely investigated from a sex/gender viewpoint, though ASD is characterized by a strong sex gap in onset-risk and progression. The aim of the present review is to point out the molecular mechanisms related to intracellular folate recycling affecting in turn remethylation and transsulfuration pathways having potential effects on ASD. Brain epigenome during fetal life necessarily reflects the sex-dependent different imprint of the genome-environment interactions which effects are difficult to decrypt. We here will focus on the DHFR, MTHFR and CBS gene-triad by dissecting their roles in a sex-oriented view, primarily to bring new perspectives in ASD epigenetics.


Sensors ◽  
2021 ◽  
Vol 21 (2) ◽  
pp. 411
Author(s):  
Yunkai Zhang ◽  
Yinghong Tian ◽  
Pingyi Wu ◽  
Dongfan Chen

The recognition of stereotyped action is one of the core diagnostic criteria of Autism Spectrum Disorder (ASD). However, it mainly relies on parent interviews and clinical observations, which lead to a long diagnosis cycle and prevents the ASD children from timely treatment. To speed up the recognition process of stereotyped actions, a method based on skeleton data and Long Short-Term Memory (LSTM) is proposed in this paper. In the first stage of our method, the OpenPose algorithm is used to obtain the initial skeleton data from the video of ASD children. Furthermore, four denoising methods are proposed to eliminate the noise of the initial skeleton data. In the second stage, we track multiple ASD children in the same scene by matching distance between current skeletons and previous skeletons. In the last stage, the neural network based on LSTM is proposed to classify the ASD children’s actions. The performed experiments show that our proposed method is effective for ASD children’s action recognition. Compared to the previous traditional schemes, our scheme has higher accuracy and is almost non-invasive for ASD children.


Toxics ◽  
2021 ◽  
Vol 9 (5) ◽  
pp. 97
Author(s):  
Tristan Furnary ◽  
Rolando Garcia-Milian ◽  
Zeyan Liew ◽  
Shannon Whirledge ◽  
Vasilis Vasiliou

Recent epidemiological studies suggest that prenatal exposure to acetaminophen (APAP) is associated with increased risk of Autism Spectrum Disorder (ASD), a neurodevelopmental disorder affecting 1 in 59 children in the US. Maternal and prenatal exposure to pesticides from food and environmental sources have also been implicated to affect fetal neurodevelopment. However, the underlying mechanisms for ASD are so far unknown, likely with complex and multifactorial etiology. The aim of this study was to explore the potential effects of APAP and pesticide exposure on development with regards to the etiology of ASD by highlighting common genes and biological pathways. Genes associated with APAP, pesticides, and ASD through human research were retrieved from molecular and biomedical literature databases. The interaction network of overlapping genetic associations was subjected to network topology analysis and functional annotation of the resulting clusters. These genes were over-represented in pathways and biological processes (FDR p < 0.05) related to apoptosis, metabolism of reactive oxygen species (ROS), and carbohydrate metabolism. Since these three biological processes are frequently implicated in ASD, our findings support the hypothesis that cell death processes and specific metabolic pathways, both of which appear to be targeted by APAP and pesticide exposure, may be involved in the etiology of ASD. This novel exposures-gene-disease database mining might inspire future work on understanding the biological underpinnings of various ASD risk factors.


2021 ◽  
Vol 4 (1) ◽  
Author(s):  
Yuen Gao ◽  
Natalia Duque-Wilckens ◽  
Mohammad B. Aljazi ◽  
Yan Wu ◽  
Adam J. Moeser ◽  
...  

AbstractAutism spectrum disorder (ASD) is a neurodevelopmental disease associated with various gene mutations. Recent genetic and clinical studies report that mutations of the epigenetic gene ASH1L are highly associated with human ASD and intellectual disability (ID). However, the causality and underlying molecular mechanisms linking ASH1L mutations to genesis of ASD/ID remain undetermined. Here we show loss of ASH1L in the developing mouse brain is sufficient to cause multiple developmental defects, core autistic-like behaviors, and impaired cognitive memory. Gene expression analyses uncover critical roles of ASH1L in regulating gene expression during neural cell development. Thus, our study establishes an ASD/ID mouse model revealing the critical function of an epigenetic factor ASH1L in normal brain development, a causality between Ash1L mutations and ASD/ID-like behaviors in mice, and potential molecular mechanisms linking Ash1L mutations to brain functional abnormalities.


Author(s):  
B. Premjith ◽  
K. P. Soman

Morphological synthesis is one of the main components of Machine Translation (MT) frameworks, especially when any one or both of the source and target languages are morphologically rich. Morphological synthesis is the process of combining two words or two morphemes according to the Sandhi rules of the morphologically rich language. Malayalam and Tamil are two languages in India which are morphologically abundant as well as agglutinative. Morphological synthesis of a word in these two languages is challenging basically because of the following reasons: (1) Abundance in morphology; (2) Complex Sandhi rules; (3) The possibilty in Malayalam to form words by combining words that belong to different syntactic categories (for example, noun and verb); and (4) The construction of a sentence by combining multiple words. We formulated the task of the morphological generation of nouns and verbs of Malayalam and Tamil as a character-to-character sequence tagging problem. In this article, we used deep learning architectures like Recurrent Neural Network (RNN) , Long Short-Term Memory Networks (LSTM) , Gated Recurrent Unit (GRU) , and their stacked and bidirectional versions for the implementation of morphological synthesis at the character level. In addition to that, we investigated the performance of the combination of the aforementioned deep learning architectures and the Conditional Random Field (CRF) in the morphological synthesis of nouns and verbs in Malayalam and Tamil. We observed that the addition of CRF to the Bidirectional LSTM/GRU architecture achieved more than 99% accuracy in the morphological synthesis of Malayalam and Tamil nouns and verbs.


2021 ◽  
pp. 155005942110549
Author(s):  
Thanga Aarthy Manoharan ◽  
Menaka Radhakrishnan

Abstract Autism spectrum disorder (ASD) is a neurodevelopmental disorder characterized by impairment in sensory modulation. These sensory modulation deficits would ultimately lead them to difficulties in adaptive behavior and intellectual functioning. The purpose of this study was to observe changes in the nervous system with responses to auditory/visual and only audio stimuli in children with autism and typically developing (TD) through electroencephalography (EEG). In this study, 20 children with ASD and 20 children with TD were considered to investigate the difference in the neural dynamics. The neural dynamics could be understood by non-linear analysis of the EEG signal. In this research to reveal the underlying nonlinear EEG dynamics, recurrence quantification analysis (RQA) is applied. RQA measures were analyzed using various parameter changes in RQA computations. In this research, the cosine distance metric was considered due to its capability of information retrieval and the other distance metrics parameters are compared for identifying the best biomarker. Each computational combination of the RQA measure and the responding channel was analyzed and discussed. To classify ASD and TD, the resulting features from RQA were fed to the designed BiLSTM (bi-long short-term memory) network. The classification accuracy was tested channel-wise for each combination. T3 and T5 channels with neighborhood selection as FAN (fixed amount of nearest neighbors) and distance metric as cosine is considered as the best-suited combination to discriminate between ASD and TD with the classification accuracy of 91.86%, respectively.


Sign in / Sign up

Export Citation Format

Share Document