PAPER2VEC AND CITE2VEC METHODS FOR ANALYZING COLLECTIONS OF SCIENTIFIC PUBLICATIONS

Vestnik komp iuternykh i informatsionnykh tekhnologii ◽

10.14489/vkit.2021.10.pp.032-039 ◽

2021 ◽

pp. 32-39

Author(s):

N. I. Tikhonov

Keyword(s):

Language Processing ◽

Document Representation ◽

Labor Costs ◽

Scientific Contribution ◽

Scientific Publications ◽

New Methods ◽

Text Collections ◽

Vector Representations ◽

Embedding Methods ◽

Document Visualization

Collections of scientific publications are growing rapidly. Scientists have access to portals containing a large number of documents. Such a large amount of data is difficult to investigate. Methods of document visualization are used to reduce labor costs, search for necessary and similar documents, evaluate the scientific contribution of certain publications and reveal hidden links between documents. The methods of document visualization can be based on various models of document representation. In recent years, word embedding methods for natural language processing have become extremely popular. Following them, methods for analyzing text collections began to appear to obtain vector representations of documents. Although there are many document analyzing systems, new methods can give new understandings of collections, have better performance for analyzing large collections of documents, or find new relationships between documents. This article discusses two methods Paper2vec and Cite2vec that get vector representations of documents using citation information. The text provides a brief description of the considered methods for analyzing collections of scientific publications, describes experiments with these methods, including the visualization of the results of the methods and a description of the problems that arise.

Download Full-text

Paper2vec and Cite2vec Methods for Analyzing Collections of Scientific Publications

Vestnik NSU Series Information Technologies ◽

10.25205/1818-7900-2021-19-3-61-69 ◽

2021 ◽

Vol 19 (3) ◽

pp. 61-69

Author(s):

N. I. Tikhonov

Keyword(s):

Scientific Publications ◽

Text Collections ◽

Vector Representations

Visualizations are used to better understand collections of scientific publications. Various methods of analyzing text collections can be used to build these visualizations. This article discusses two methods Paper2vec and Cite2vec that get vector representations of documents using citation information. To demonstrate a work of these techniques and an example of their application, visualizations were developed, which are described in this paper.

Download Full-text

Mining Inter-Relationships in Online Scientific Articles and its Visualization: Natural Language Processing for Systems Biology Modeling

International Journal of Online and Biomedical Engineering (iJOE) ◽

10.3991/ijoe.v15i02.9432 ◽

2019 ◽

Vol 15 (02) ◽

pp. 39

Author(s):

Nidheesh Melethadathil ◽

Jaap Heringa ◽

Bipin Nair ◽

Shyam Diwakar

Keyword(s):

Natural Language Processing ◽

Systems Biology ◽

Natural Language ◽

Language Processing ◽

Clustering Algorithms ◽

Scientific Publications ◽

User Query ◽

Online Databases ◽

Clustering Quality ◽

Document Visualization

<strong>With the rapid growth in the numbers of scientific publications in domains such as neuroscience and medicine, visually interlinking documents in online databases such as PubMed with the purpose of indicating the context of a query results can improve the multi-disciplinary relevance of the search results. Translational medicine and systems biology rely on studies relating basic sciences to applications, often going through multiple disciplinary domains. This paper focuses on the design and development of a new scientific document visualization platform, which allows inferring translational aspects in biosciences within published articles using machine learning and natural language processing (NLP) methods. From online databases, this software platform effectively extracted relationship connections between multiple sub-domains within neuroscience derived from abstracts related to user query. In our current implementation, the document visualization platform employs two clustering algorithms namely Suffix Tree Clustering (STC) and LINGO. Clustering quality was improved by mapping top-ranked cluster labels derived from an UMLS-Metathesaurus using a scoring function. To avoid non-clustered documents, an iterative scheme, called auto-clustering was developed and this allowed mapping previously uncategorized documents during the initial grouping process to relevant clusters. The efficacy of this document clustering and visualization platform was evaluated by expert-based validation of clustering results obtained with unique search terms. Compared to normal clustering, auto-clustering demonstrated better efficacy by generating larger numbers of unique and relevant cluster labels. Using this implementation, a Parkinson’s disease systems theory model was developed and studies based on user queries related to neuroscience and oncology have been showcased as applications.</strong>

Download Full-text

Mol2vec: Unsupervised Machine Learning Approach with Chemical Intuition

10.26434/chemrxiv.5513581.v1 ◽

2017 ◽

Author(s):

Sabrina Jaeger ◽

Simone Fulle ◽

Samo Turk

Keyword(s):

Machine Learning ◽

Language Processing ◽

Supervised Machine Learning ◽

Learning Approach ◽

Learning Approaches ◽

Unsupervised Machine Learning ◽

Feature Representations ◽

Machine Learning Approach ◽

The Individual ◽

Vector Representations

Inspired by natural language processing techniques we here introduce Mol2vec which is an unsupervised machine learning approach to learn vector representations of molecular substructures. Similarly, to the Word2vec models where vectors of closely related words are in close proximity in the vector space, Mol2vec learns vector representations of molecular substructures that are pointing in similar directions for chemically related substructures. Compounds can finally be encoded as vectors by summing up vectors of the individual substructures and, for instance, feed into supervised machine learning approaches to predict compound properties. The underlying substructure vector embeddings are obtained by training an unsupervised machine learning approach on a so-called corpus of compounds that consists of all available chemical matter. The resulting Mol2vec model is pre-trained once, yields dense vector representations and overcomes drawbacks of common compound feature representations such as sparseness and bit collisions. The prediction capabilities are demonstrated on several compound property and bioactivity data sets and compared with results obtained for Morgan fingerprints as reference compound representation. Mol2vec can be easily combined with ProtVec, which employs the same Word2vec concept on protein sequences, resulting in a proteochemometric approach that is alignment independent and can be thus also easily used for proteins with low sequence similarities.

Download Full-text

Machine learning for rediscovering revolutionary ideas of the past

Adaptive Behavior ◽

10.1177/1059712320983045 ◽

2021 ◽

pp. 105971232098304

Author(s):

R Alexander Bentley ◽

Joshua Borycz ◽

Simon Carrignon ◽

Damian J Ruck ◽

Michael J O’Brien

Keyword(s):

Language Processing ◽

Scientific Progress ◽

Phylogenetic Inference ◽

The Past ◽

New Methods ◽

Inherent Nature ◽

Social Learning Processes ◽

High Level ◽

Scientific Ideas ◽

Revolutionary Science

The explosion of online knowledge has made knowledge, paradoxically, difficult to find. A web or journal search might retrieve thousands of articles, ranked in a manner that is biased by, for example, popularity or eigenvalue centrality rather than by informed relevance to the complex query. With hundreds of thousands of articles published each year, the dense, tangled thicket of knowledge grows even more entwined. Although natural language processing and new methods of generating knowledge graphs can extract increasingly high-level interpretations from research articles, the results are inevitably biased toward recent, popular, and/or prestigious sources. This is a result of the inherent nature of human social-learning processes. To preserve and even rediscover lost scientific ideas, we employ the theory that scientific progress is punctuated by means of inspired, revolutionary ideas at the origin of new paradigms. Using a brief case example, we suggest how phylogenetic inference might be used to rediscover potentially useful lost discoveries, as a way in which machines could help drive revolutionary science.

Download Full-text

How long can we build it? Ensuring usability of a scientific code base

International Journal of Digital Curation ◽

10.2218/ijdc.v16i1.770 ◽

2021 ◽

Vol 16 (1) ◽

pp. 11

Author(s):

Klaus Rechert ◽

Jurek Oberhauser ◽

Rafael Gieschke

Keyword(s):

Data Management ◽

Source Code ◽

Research Data ◽

Scientific Contribution ◽

Scientific Publications ◽

Research Data Management ◽

Code Base

Software and in particular source code became an important component of scientific publications and henceforth is now subject of research data management. Maintaining source code such that it remains a usable and a valuable scientific contribution is and remains a huge task. Not all code contributions can be actively maintained forever. Eventually, there will be a significant backlog of legacy source-code. In this article we analyse the requirements for applying the concept of long-term reusability to source code. We use simple case study to identify gaps and provide a technical infrastructure based on emulator to support automated builds of historic software in form of source code.

Download Full-text

Extending semantic context analysis using machine learning services to process unstructured data

SHS Web of Conferences ◽

10.1051/shsconf/202110202001 ◽

2021 ◽

Vol 102 ◽

pp. 02001

Author(s):

Anja Wilhelm ◽

Wolfgang Ziegler

Keyword(s):

Machine Learning ◽

Language Processing ◽

Knowledge Bases ◽

Semantic Context ◽

Unstructured Data ◽

Context Analysis ◽

Scientific Publications ◽

Starting Point ◽

Processing And Storage ◽

Modelling Process

The primary focus of technical communication (TC) in the past decade has been the system-assisted generation and utilization of standardized, structured, and classified content for dynamic output solutions. Nowadays, machine learning (ML) approaches offer a new opportunity to integrate unstructured data into existing knowledge bases without the need to manually organize information into topic-based content enriched with semantic metadata. To make the field of artificial intelligence (AI) more accessible for technical writers and content managers, cloud-based machine learning as a service (MLaaS) solutions provide a starting point for domain-specific ML modelling while unloading the modelling process from extensive coding, data processing and storage demands. Therefore, information architects can focus on information extraction tasks and on prospects to include pre-existing knowledge from other systems into the ML modelling process. In this paper, the capability and performance of a cloud-based ML service, IBM Watson, are analysed to assess their value for semantic context analysis. The ML model is based on a supervised learning method and features deep learning (DL) and natural language processing (NLP) techniques. The subject of the analysis is a corpus of scientific publications on the 2019 Coronavirus disease. The analysis focuses on information extractions regarding preventive measures and effects of the pandemic on healthcare workers.

Download Full-text

Embeddings in Natural Language Processing. Theory and Advances in Vector Representations of Meaning

Computational Linguistics ◽

10.1162/coli_r_00410 ◽

2021 ◽

pp. 1-4

Author(s):

Marcos Garcia

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Processing Theory ◽

Vector Representations

Download Full-text

Sentiment of App with Word Vectors

International Journal of Engineering and Advanced Technology - Regular Issue ◽

10.35940/ijeat.f1416.0986s319 ◽

2019 ◽

Vol 8 (6S3) ◽

pp. 2156-2159

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Sentiment Analysis ◽

Language Processing ◽

Text Data ◽

Vector Representations ◽

Text Sentiment Analysis

Vector representations for language have been shown to be useful in a number of Natural Language Processing tasks. In this paper, we aim to investigate the effectiveness of word vector representations for the problem of Sentiment Analysis. In particular, we target three sub-tasks namely sentiment words extraction, polarity of sentiment words detection, and text sentiment prediction. We investigate the effectiveness of vector representations over different text data and evaluate the quality of domain-dependent vectors. Vector representations has been used to compute various vector-based features and conduct systematically experiments to demonstrate their effectiveness. Using simple vector based features can achieve better results for text sentiment analysis of APP.

Download Full-text

A Natural Language Processing System for Extracting Evidence of Drug Repurposing from Scientific Publications

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i08.7052 ◽

2020 ◽

Vol 34 (08) ◽

pp. 13369-13381

Author(s):

Shivashankar Subramanian ◽

Ioana Baldini ◽

Sushma Ravichandran ◽

Dmitriy A. Katz-Rogozhnikov ◽

Karthikeyan Natesan Ramamurthy ◽

...

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Generic Drugs ◽

Low Cost ◽

Processing System ◽

Drug Repurposing ◽

Cancer Type ◽

Entity Extraction ◽

Scientific Publications

More than 200 generic drugs approved by the U.S. Food and Drug Administration for non-cancer indications have shown promise for treating cancer. Due to their long history of safe patient use, low cost, and widespread availability, repurposing of these drugs represents a major opportunity to rapidly improve outcomes for cancer patients and reduce healthcare costs. In many cases, there is already evidence of efficacy for cancer, but trying to manually extract such evidence from the scientific literature is intractable. In this emerging applications paper, we introduce a system to automate non-cancer generic drug evidence extraction from PubMed abstracts. Our primary contribution is to define the natural language processing pipeline required to obtain such evidence, comprising the following modules: querying, filtering, cancer type entity extraction, therapeutic association classification, and study type classification. Using the subject matter expertise on our team, we create our own datasets for these specialized domain-specific tasks. We obtain promising performance in each of the modules by utilizing modern language processing techniques and plan to treat them as baseline approaches for future improvement of individual components.

Download Full-text

Attention-based Unsupervised Keyphrase Extraction and Phrase Graph for COVID-19 Medical Literature Retrieval

ACM Transactions on Computing for Healthcare ◽

10.1145/3473939 ◽

2022 ◽

Vol 3 (1) ◽

pp. 1-16

Author(s):

Haoran Ding ◽

Xiao Luo

Keyword(s):

Neural Networks ◽

Natural Language Processing ◽

Language Processing ◽

Medical Literature ◽

Graph Model ◽

The Self ◽

Keyphrase Extraction ◽

Text Data ◽

Text Collections ◽

Extraction Model

Searching, reading, and finding information from the massive medical text collections are challenging. A typical biomedical search engine is not feasible to navigate each article to find critical information or keyphrases. Moreover, few tools provide a visualization of the relevant phrases to the query. However, there is a need to extract the keyphrases from each document for indexing and efficient search. The transformer-based neural networks—BERT has been used for various natural language processing tasks. The built-in self-attention mechanism can capture the associations between words and phrases in a sentence. This research investigates whether the self-attentions can be utilized to extract keyphrases from a document in an unsupervised manner and identify relevancy between phrases to construct a query relevancy phrase graph to visualize the search corpus phrases on their relevancy and importance. The comparison with six baseline methods shows that the self-attention-based unsupervised keyphrase extraction works well on a medical literature dataset. This unsupervised keyphrase extraction model can also be applied to other text data. The query relevancy graph model is applied to the COVID-19 literature dataset and to demonstrate that the attention-based phrase graph can successfully identify the medical phrases relevant to the query terms.

Download Full-text