Pangaea: A modular and extensible collection of tools for mining context dependent gene relationships from the biomedical literature

AbstractMotivationPangaea is a scalable and extensible command line interface (CLI) software that integrates gene-relationship detection features to extract context-dependent structured gene-gene and gene-term relationships from the biomedical literature. It provides computational methods to identify biological relationships between a collection of genes and can be used to search and extract different types of contextual relationships amongst genes.ResultsWe implemented a CLI-based software for downloading PubMed articles and extracting gene relationships from abstracts using natural language processing methods. In terms of scalability, the software was designed to support the retrieval and processing of millions of articles whilst minimising memory requirements and optimising for parallel processing on multiple CPU cores. To allow extensibility, the tool permits the use of contextual custom-made models for the text processing parts, and the output is serialised as JSON objects to allow flexible post-processing workflows.AvailabilityThe software is available online at: https://github.com/ss-lab-cancerunit/pangaea

Download Full-text

Automation of solving planimetry problems written in Ukrainian

PROBLEMS IN PROGRAMMING ◽

10.15407/pp2020.04.071 ◽

2020 ◽

pp. 071-080

Author(s):

O.P. Zhezherun ◽

◽

O.R. Smysh ◽

◽

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Text Processing ◽

Comprehensive Analysis ◽

Text Representation ◽

Natural Languages ◽

Different Types ◽

Mathematical Problems ◽

Further Development

The article focuses on developing a software solution for solving planimetry problems that are written in Ukrainian. We discuss tendencies and available abilities in Ukrainian natural language processing. Presenting a comprehensive analysis of different types of describing a problem, which shows regularities in the formulation and structure of the text representation of problems. Also, we demonstrate the similarities of writing a problem not only in Ukrainian but also in Belarusian, English, and Russian languages. The final result of the paper is a system that uses the morphosyntactic analyzer to process a problem’s text and provide the answer to it. Ukrainian natural language processing is growing rapidly and showing impressive results. Huge possibilities appear as the Gold standard annotated corpus for Ukrainian language was recently developed. The created architecture is flexible, which indicates the possibility of adding both new geometry figures and their properties, as well as the additional logic to the program. The developed system with a little reformatting can be used with other natural languages, such as English, Belarusian or Russian, as the algorithm for text processing is universal due to the globally accepted representations for presenting such types of mathematical problems. Therefore, the further development of the system is possible.

Download Full-text

Advances in Computational Linguistics and Text Processing Frameworks

Advances in Computer and Electrical Engineering - Handbook of Research on Engineering Innovations and Technology Management in Organizations ◽

10.4018/978-1-7998-2772-6.ch012 ◽

2020 ◽

pp. 217-244

Author(s):

Ayush Srivastav ◽

Hera Khan ◽

Amit Kumar Mishra

Keyword(s):

Neural Networks ◽

Natural Language Processing ◽

Natural Language ◽

Computational Linguistics ◽

Language Processing ◽

Text Processing ◽

Named Entity Recognition ◽

Entity Recognition ◽

Named Entity ◽

Part Of Speech

The chapter provides an eloquent account of the major methodologies and advances in the field of Natural Language Processing. The most popular models that have been used over time for the task of Natural Language Processing have been discussed along with their applications in their specific tasks. The chapter begins with the fundamental concepts of regex and tokenization. It provides an insight to text preprocessing and its methodologies such as Stemming and Lemmatization, Stop Word Removal, followed by Part-of-Speech tagging and Named Entity Recognition. Further, this chapter elaborates the concept of Word Embedding, its various types, and some common frameworks such as word2vec, GloVe, and fastText. A brief description of classification algorithms used in Natural Language Processing is provided next, followed by Neural Networks and its advanced forms such as Recursive Neural Networks and Seq2seq models that are used in Computational Linguistics. A brief description of chatbots and Memory Networks concludes the chapter.

Download Full-text

Machine Learning in Natural Language Processing

Handbook of Research on Machine Learning Applications and Trends ◽

10.4018/978-1-60566-766-9.ch014 ◽

2010 ◽

pp. 302-324

Author(s):

Marina Sokolova ◽

Stan Szpakowicz

Keyword(s):

Machine Learning ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Text Processing ◽

Word Sense Disambiguation ◽

Machine Learning Techniques ◽

Word Sense ◽

Part Of Speech ◽

Applications Of Machine Learning

This chapter presents applications of machine learning techniques to traditional problems in natural language processing, including part-of-speech tagging, entity recognition and word-sense disambiguation. People usually solve such problems without difficulty or at least do a very good job. Linguistics may suggest labour-intensive ways of manually constructing rule-based systems. It is, however, the easy availability of large collections of texts that has made machine learning a method of choice for processing volumes of data well above the human capacity. One of the main purposes of text processing is all manner of information extraction and knowledge extraction from such large text. Machine learning methods discussed in this chapter have stimulated wide-ranging research in natural language processing and helped build applications with serious deployment potential.

Download Full-text

Promises of text processing: natural language processing meets AI

Drug Discovery Today ◽

10.1016/s1359-6446(02)02457-1 ◽

2002 ◽

Vol 7 (19) ◽

pp. 992-993 ◽

Cited By ~ 3

Author(s):

Jeffrey T Chang ◽

Russ B Altman

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Text Processing

Download Full-text

Using distant supervision to augment manually annotated data for relation extraction

10.1101/626226 ◽

2019 ◽

Author(s):

Peng Su ◽

Gang Li ◽

Cathy Wu ◽

K. Vijay-Shanker

Keyword(s):

Deep Learning ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Relation Extraction ◽

Biomedical Literature ◽

Training Data ◽

Distant Supervision ◽

Large Size ◽

Domain Expertise

AbstractSignificant progress has been made in applying deep learning on natural language processing tasks recently. However, deep learning models typically require a large amount of annotated training data while often only small labeled datasets are available for many natural language processing tasks in biomedical literature. Building large-size datasets for deep learning is expensive since it involves considerable human effort and usually requires domain expertise in specialized fields. In this work, we consider augmenting manually annotated data with large amounts of data using distant supervision. However, data obtained by distant supervision is often noisy, we first apply some heuristics to remove some of the incorrect annotations. Then using methods inspired from transfer learning, we show that the resulting models outperform models trained on the original manually annotated sets.

Download Full-text

Natural Language Processing methods for Document Matching

International Journal for Modern Trends in Science and Technology - RTT2020 ◽

10.46501/ijmtst061271 ◽

2020 ◽

Vol 6 (12) ◽

pp. 379-383

Author(s):

Maitri Patel and Dr Hemant D Vasava

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Original Work ◽

Educational Systems ◽

Processing Methods ◽

Similar Test ◽

Subject Examination ◽

Different Types ◽

Higher Educational

Data,Information or knoweldge,in this rapidly moving and growing world.we can find any kind of information on Internet.And this can be too useful,however for acedemic world too it is useful but along with it plagarism is highly in practice.Which makes orginality of work degrade and fraudly using someones original work and later not acknowleging them is becoming common.And some times teachers or professors could not identify the plagarised information provided.So higher educational systems nowadays use different types of tools to compare.Here we have an idea to match no of different documents like assignments of students to compare with each other to find out, did they copied each other’s work?Also an idea to compare ideal answeer sheet of particular subject examination to similar test sheets of students.Idea is to compare and on similarity basis we can rank them.Both approach is one kind and that is to compare documents.To identify plagarism there are many methods used already.So we could compare and develop them if needed.

Download Full-text

BiDI: Using Machine Learning to collect and facilitate remote access to biomedical databases (Preprint)

10.2196/preprints.22976 ◽

2020 ◽

Author(s):

Eduardo Rosado ◽

Miguel Garcia-Remesal Sr ◽

Sergio Paraiso-Medina Sr ◽

Alejandro Pazos Sr ◽

Victor Maojo Sr

Keyword(s):

Deep Learning ◽

Text Processing ◽

Biomedical Literature ◽

Remote Access ◽

Pubmed Central ◽

Processing Pipeline ◽

Learning Techniques ◽

Different Types ◽

Automatic Text ◽

Biomedical Databases

BACKGROUND Currently, existing biomedical literature repositories do not commonly provide users with specific means to locate and remotely access biomedical databases. OBJECTIVE To address this issue we developed BiDI (Biomedical Database Inventory), a repository linking to biomedical databases automatically extracted from the scientific literature. BiDI provides an index of data resources and a path to access them in a seamless manner. METHODS We designed an ensemble of Deep Learning methods to extract database mentions. To train the system we annotated a set of 1,242 articles that included mentions to database publications. Such a dataset was used along with transfer learning techniques to train an ensemble of deep learning NLP models based on the task of database publication detection. RESULTS The system obtained an f1-score of 0.929 on database detection, showing high precision and recall values. Applying this model to the PubMed and PubMed Central databases we identified over 10,000 unique databases. The ensemble also extracts the web links to the reported databases, discarding the irrelevant links. For the extraction of web links the model achieved a cross-validated f1-score of 0.908. We show two use cases, related to “omics” and the COVID-19 pandemia. CONCLUSIONS BiDI enables the access of biomedical resources over the Internet and facilitates data-driven research and other scientific initiatives. The repository is available at (http://gib.fi.upm.es/bidi/) and will be regularly updated with an automatic text processing pipeline. The approach can be reused to create repositories of different types (biomedical and others).

Download Full-text

Measuring the Evolution of a Scientific Field through Citation Frames

Transactions of the Association for Computational Linguistics ◽

10.1162/tacl_a_00028 ◽

2018 ◽

Vol 6 ◽

pp. 391-406 ◽

Cited By ~ 19

Author(s):

David Jurgens ◽

Srijan Kumar ◽

Raine Hoover ◽

Dan McFarland ◽

Dan Jurafsky

Keyword(s):

Natural Language Processing ◽

Citation Analysis ◽

Language Processing ◽

State Of The Art ◽

Citation Count ◽

Scientific Field ◽

Discourse Structure ◽

Small Scale ◽

Behavioral Study ◽

Different Types

Citations have long been used to characterize the state of a scientific field and to identify influential works. However, writers use citations for different purposes, and this varied purpose influences uptake by future scholars. Unfortunately, our understanding of how scholars use and frame citations has been limited to small-scale manual citation analysis of individual papers. We perform the largest behavioral study of citations to date, analyzing how scientific works frame their contributions through different types of citations and how this framing affects the field as a whole. We introduce a new dataset of nearly 2,000 citations annotated for their function, and use it to develop a state-of-the-art classifier and label the papers of an entire field: Natural Language Processing. We then show how differences in framing affect scientific uptake and reveal the evolution of the publication venues and the field as a whole. We demonstrate that authors are sensitive to discourse structure and publication venue when citing, and that how a paper frames its work through citations is predictive of the citation count it will receive. Finally, we use changes in citation framing to show that the field of NLP is undergoing a significant increase in consensus.

Download Full-text

Pronoun Resolution in Unrestricted Text

Nordic Journal of Linguistics ◽

10.1017/s0332586500001748 ◽

1988 ◽

Vol 11 (1-2) ◽

pp. 47-68 ◽

Cited By ~ 8

Author(s):

Kari Fraurud

Keyword(s):

Natural Language Processing ◽

Language Processing ◽

Empirical Data ◽

Anaphora Resolution ◽

Theoretical Understanding ◽

Pronoun Resolution ◽

Anaphoric Pronouns ◽

Different Types ◽

Resolution Algorithm ◽

Swedish Text

Quantitative and qualitative studies of referential relations in unrestricted natural text are necessary both for a better theoretical understanding of referential processes, and for the development of empirically well-founded algorithms for anaphora resolution in the framework of natural language processing (NLP) systems. The aim of the study reported in this paper was to provide preliminary empirical data on anaphoric pronouns in Swedish. The relation between the pronoun and its antecedent was studied for 600 pronouns in three different types of unrestricted written Swedish text, and a simple pronoun resolution algorithm was tested on the sample.

Download Full-text

Aspiring to Unintended Consequences of Natural Language Processing: A Review of Recent Developments in Clinical and Consumer-Generated Text Processing

Yearbook of Medical Informatics ◽

10.15265/iy-2016-017 ◽

2016 ◽

Vol 25 (01) ◽

pp. 224-233 ◽

Cited By ~ 11

Author(s):

N. Elhadad ◽

D. Demner-Fushman

Keyword(s):

Social Media ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Text Analysis ◽

Disease Modeling ◽

Text Processing ◽

Healthcare Quality ◽

Unintended Consequences ◽

Health Related

Summary Objectives: This paper reviews work over the past two years in Natural Language Processing (NLP) applied to clinical and consumer-generated texts. Methods: We included any application or methodological publication that leverages text to facilitate healthcare and address the health-related needs of consumers and populations. Results: Many important developments in clinical text processing, both foundational and task-oriented, were addressed in community-wide evaluations and discussed in corresponding special issues that are referenced in this review. These focused issues and in-depth reviews of several other active research areas, such as pharmacovigilance and summarization, allowed us to discuss in greater depth disease modeling and predictive analytics using clinical texts, and text analysis in social media for healthcare quality assessment, trends towards online interventions based on rapid analysis of health-related posts, and consumer health question answering, among other issues. Conclusions: Our analysis shows that although clinical NLP continues to advance towards practical applications and more NLP methods are used in large-scale live health information applications, more needs to be done to make NLP use in clinical applications a routine widespread reality. Progress in clinical NLP is mirrored by developments in social media text analysis: the research is moving from capturing trends to addressing individual health-related posts, thus showing potential to become a tool for precision medicine and a valuable addition to the standard healthcare quality evaluation tools.

Download Full-text