scholarly journals CoCoScore: context-aware co-occurrence scoring for text mining applications using distant supervision

2019 ◽  
Vol 36 (1) ◽  
pp. 264-271 ◽  
Author(s):  
Alexander Junge ◽  
Lars Juhl Jensen

Abstract Motivation Information extraction by mining the scientific literature is key to uncovering relations between biomedical entities. Most existing approaches based on natural language processing extract relations from single sentence-level co-mentions, ignoring co-occurrence statistics over the whole corpus. Existing approaches counting entity co-occurrences ignore the textual context of each co-occurrence. Results We propose a novel corpus-wide co-occurrence scoring approach to relation extraction that takes the textual context of each co-mention into account. Our method, called CoCoScore, scores the certainty of stating an association for each sentence that co-mentions two entities. CoCoScore is trained using distant supervision based on a gold-standard set of associations between entities of interest. Instead of requiring a manually annotated training corpus, co-mentions are labeled as positives/negatives according to their presence/absence in the gold standard. We show that CoCoScore outperforms previous approaches in identifying human disease–gene and tissue–gene associations as well as in identifying physical and functional protein–protein associations in different species. CoCoScore is a versatile text mining tool to uncover pairwise associations via co-occurrence mining, within and beyond biomedical applications. Availability and implementation CoCoScore is available at: https://github.com/JungeAlexander/cocoscore. Supplementary information Supplementary data are available at Bioinformatics online.

2018 ◽  
Author(s):  
Alexander Junge ◽  
Lars Juhl Jensen

Information extraction by mining the scientific literature is key to uncovering relations between biomedical entities. Most existing approaches based on natural language processing extract relations from single sentence-level co-mentions, ignoring co-occurrence statistics over the whole corpus. Existing approaches counting entity co-occurrences ignore the textual context of each co-occurrence. We propose a novel corpus-wide co-occurrence scoring approach to relation extraction that takes the textual context of each co-mention into account. Our method, called CoCoScore, scores the certainty of stating an association for each sentence that co-mentions two entities. CoCoScore is trained using distant supervision based on a gold-standard set of associations between entities of interest. Instead of requiring a manually annotated training corpus, co-mentions are labeled as positives/negatives according to their presence/absence in the gold standard. We show that CoCoScore outperforms previous approaches in identifying human disease-gene and tissue-gene associations as well as in identifying physical and functional protein-protein associations in different species. CoCoScore is a versatile text-mining tool to uncover pairwise associations via co-occurrence mining, within and beyond biomedical applications. CoCoScore is available at: https://github.com/JungeAlexander/cocoscore


2020 ◽  
Vol 36 (Supplement_1) ◽  
pp. i490-i498
Author(s):  
Leon Weber ◽  
Kirsten Thobe ◽  
Oscar Arturo Migueles Lozano ◽  
Jana Wolf ◽  
Ulf Leser

Abstract Motivation A significant portion of molecular biology investigates signalling pathways and thus depends on an up-to-date and complete resource of functional protein–protein associations (PPAs) that constitute such pathways. Despite extensive curation efforts, major pathway databases are still notoriously incomplete. Relation extraction can help to gather such pathway information from biomedical publications. Current methods for extracting PPAs typically rely exclusively on rare manually labelled data which severely limits their performance. Results We propose PPA Extraction with Deep Language (PEDL), a method for predicting PPAs from text that combines deep language models and distant supervision. Due to the reliance on distant supervision, PEDL has access to an order of magnitude more training data than methods solely relying on manually labelled annotations. We introduce three different datasets for PPA prediction and evaluate PEDL for the two subtasks of predicting PPAs between two proteins, as well as identifying the text spans stating the PPA. We compared PEDL with a recently published state-of-the-art model and found that on average PEDL performs better in both tasks on all three datasets. An expert evaluation demonstrates that PEDL can be used to predict PPAs that are missing from major pathway databases and that it correctly identifies the text spans supporting the PPA. Availability and implementation PEDL is freely available at https://github.com/leonweber/pedl. The repository also includes scripts to generate the used datasets and to reproduce the experiments from this article. Supplementary information Supplementary data are available at Bioinformatics online.


Author(s):  
Neha Warikoo ◽  
Yung-Chun Chang ◽  
Wen-Lian Hsu

Abstract Motivation Natural Language Processing techniques are constantly being advanced to accommodate the influx of data as well as to provide exhaustive and structured knowledge dissemination. Within the biomedical domain, relation detection between bio-entities known as the Bio-Entity Relation Extraction (BRE) task has a critical function in knowledge structuring. Although recent advances in deep learning-based biomedical domain embedding have improved BRE predictive analytics, these works are often task selective or use external knowledge-based pre-/post-processing. In addition, deep learning-based models do not account for local syntactic contexts, which have improved data representation in many kernel classifier-based models. In this study, we propose a universal BRE model, i.e. LBERT, which is a Lexically aware Transformer-based Bidirectional Encoder Representation model, and which explores both local and global contexts representations for sentence-level classification tasks. Results This article presents one of the most exhaustive BRE studies ever conducted over five different bio-entity relation types. Our model outperforms state-of-the-art deep learning models in protein–protein interaction (PPI), drug–drug interaction and protein–bio-entity relation classification tasks by 0.02%, 11.2% and 41.4%, respectively. LBERT representations show a statistically significant improvement over BioBERT in detecting true bio-entity relation for large corpora like PPI. Our ablation studies clearly indicate the contribution of the lexical features and distance-adjusted attention in improving prediction performance by learning additional local semantic context along with bi-directionally learned global context. Availability and implementation Github. https://github.com/warikoone/LBERT. Supplementary information Supplementary data are available at Bioinformatics online.


2019 ◽  
Author(s):  
Peng Su ◽  
Gang Li ◽  
Cathy Wu ◽  
K. Vijay-Shanker

AbstractSignificant progress has been made in applying deep learning on natural language processing tasks recently. However, deep learning models typically require a large amount of annotated training data while often only small labeled datasets are available for many natural language processing tasks in biomedical literature. Building large-size datasets for deep learning is expensive since it involves considerable human effort and usually requires domain expertise in specialized fields. In this work, we consider augmenting manually annotated data with large amounts of data using distant supervision. However, data obtained by distant supervision is often noisy, we first apply some heuristics to remove some of the incorrect annotations. Then using methods inspired from transfer learning, we show that the resulting models outperform models trained on the original manually annotated sets.


Author(s):  
Yujin Yuan ◽  
Liyuan Liu ◽  
Siliang Tang ◽  
Zhongfei Zhang ◽  
Yueting Zhuang ◽  
...  

Distant supervision leverages knowledge bases to automatically label instances, thus allowing us to train relation extractor without human annotations. However, the generated training data typically contain massive noise, and may result in poor performances with the vanilla supervised learning. In this paper, we propose to conduct multi-instance learning with a novel Cross-relation Cross-bag Selective Attention (C2SA), which leads to noise-robust training for distant supervised relation extractor. Specifically, we employ the sentence-level selective attention to reduce the effect of noisy or mismatched sentences, while the correlation among relations were captured to improve the quality of attention weights. Moreover, instead of treating all entity-pairs equally, we try to pay more attention to entity-pairs with a higher quality. Similarly, we adopt the selective attention mechanism to achieve this goal. Experiments with two types of relation extractor demonstrate the superiority of the proposed approach over the state-of-the-art, while further ablation studies verify our intuitions and demonstrate the effectiveness of our proposed two techniques.


Now a day the data grows day by day so data mining replaced by big data. Under data mining, Text mining is one of the processes of deriving structured or quality information or data from text document. It helps to business for finding valuable knowledge. Sentiment analysis is one of the applications in text mining. In sentiment analysis, determine the emotional tone under the text. It is the major task of natural language processing. The objective of this paper to categorize the document in sentence level and review level, and classification techniques applied on the dataset (electronic product data). There is an ensemble number of classification techniques applied on the dataset. Then compare each techniques, based on various parameters and find out which one is best. According to that give better suggestions to the company for improving the product.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Sijie Li ◽  
Ziqi Guo ◽  
Jacob B. Ioffe ◽  
Yunfei Hu ◽  
Yi Zhen ◽  
...  

AbstractAutism is a spectrum disorder with wide variation in type and severity of symptoms. Understanding gene–phenotype associations is vital to unravel the disease mechanisms and advance its diagnosis and treatment. To date, several databases have stored a large portion of gene–phenotype associations which are mainly obtained from genetic experiments. However, a large proportion of gene–phenotype associations are still buried in the autism-related literature and there are limited resources to investigate autism-associated gene–phenotype associations. Given the abundance of the autism-related literature, we were thus motivated to develop Autism_genepheno, a text mining pipeline to identify sentence-level mentions of autism-associated genes and phenotypes in literature through natural language processing methods. We have generated a comprehensive database of gene–phenotype associations in the last five years’ autism-related literature that can be easily updated as new literature becomes available. We have evaluated our pipeline through several different approaches, and we are able to rank and select top autism-associated genes through their unique and wide spectrum of phenotypic profiles, which could provide a unique resource for the diagnosis and treatment of autism. The data resources and the Autism_genpheno pipeline are available at: https://github.com/maiziezhoulab/Autism_genepheno.


2021 ◽  
Author(s):  
Sijie Li ◽  
Ziqi Guo ◽  
Jacob B Ioffe ◽  
Yunfei Hu ◽  
Yi Zhen ◽  
...  

Autism is a spectrum disorder with wide variation in type and severity of symptoms. Understanding gene-phenotype associations is vital to unravel the disease mechanisms and advance its diagnosis and treatment. To date, several databases have stored a large portion of gene-phenotype associations which are mainly obtained from genetic experiments. However, a large proportion of gene-phenotype associations are still buried in the autism-related literature and there are limited resources to investigate autism-associated gene-phenotype associations. Given the abundance of the autism-related literature, we were thus motivated to develop Autism_genepheno, a text mining pipeline to identify sentence-level mentions of autism-associated genes and phenotypes in literature through natural language processing methods. We have generated a comprehensive database of gene-phenotype associations in the last five years' autism-related literature that can be easily updated as new literature becomes available. We have evaluated our pipeline through several different approaches, and we are able to rank and select top autism-associated genes through their unique and wide spectrum of phenotypic profiles, which could provide a unique resource for the diagnosis and treatment of autism. The data resources and the Autism_genpheno pipeline are available at: https://github.com/maiziezhoulab/Autism_genepheno.


2020 ◽  
Vol 33 (5) ◽  
pp. 1357-1380
Author(s):  
Yilu Zhou ◽  
Yuan Xue

PurposeStrategic alliances among organizations are some of the central drivers of innovation and economic growth. However, the discovery of alliances has relied on pure manual search and has limited scope. This paper proposes a text-mining framework, ACRank, that automatically extracts alliances from news articles. ACRank aims to provide human analysts with a higher coverage of strategic alliances compared to existing databases, yet maintain a reasonable extraction precision. It has the potential to discover alliances involving less well-known companies, a situation often neglected by commercial databases.Design/methodology/approachThe proposed framework is a systematic process of alliance extraction and validation using natural language processing techniques and alliance domain knowledge. The process integrates news article search, entity extraction, and syntactic and semantic linguistic parsing techniques. In particular, Alliance Discovery Template (ADT) identifies a number of linguistic templates expanded from expert domain knowledge and extract potential alliances at sentence-level. Alliance Confidence Ranking (ACRank)further validates each unique alliance based on multiple features at document-level. The framework is designed to deal with extremely skewed, noisy data from news articles.FindingsIn evaluating the performance of ACRank on a gold standard data set of IBM alliances (2006–2008) showed that: Sentence-level ADT-based extraction achieved 78.1% recall and 44.7% precision and eliminated over 99% of the noise in news articles. ACRank further improved precision to 97% with the top20% of extracted alliance instances. Further comparison with Thomson Reuters SDC database showed that SDC covered less than 20% of total alliances, while ACRank covered 67%. When applying ACRank to Dow 30 company news articles, ACRank is estimated to achieve a recall between 0.48 and 0.95, and only 15% of the alliances appeared in SDC.Originality/valueThe research framework proposed in this paper indicates a promising direction of building a comprehensive alliance database using automatic approaches. It adds value to academic studies and business analyses that require in-depth knowledge of strategic alliances. It also encourages other innovative studies that use text mining and data analytics to study business relations.


2015 ◽  
Vol 2015 ◽  
pp. 1-10 ◽  
Author(s):  
Yang Xiang ◽  
Yaoyun Zhang ◽  
Xiaolong Wang ◽  
Yang Qin ◽  
Wenying Han

Distant supervision (DS) automatically annotates free text with relation mentions from existing knowledge bases (KBs), providing a way to alleviate the problem of insufficient training data for relation extraction in natural language processing (NLP). However, the heuristic annotation process does not guarantee the correctness of the generated labels, promoting a hot research issue on how to efficiently make use of the noisy training data. In this paper, we model two types of biases to reduce noise: (1)bias-distto model the relative distance between points (instances) and classes (relation centers); (2)bias-rewardto model the possibility of each heuristically generated label being incorrect. Based on the biases, we propose three noise tolerant models:MIML-dist,MIML-dist-classify, andMIML-reward, building on top of a state-of-the-art distantly supervised learning algorithm. Experimental evaluations compared with three landmark methods on the KBP dataset validate the effectiveness of the proposed methods.


Sign in / Sign up

Export Citation Format

Share Document