PAPER2VEC AND CITE2VEC METHODS FOR ANALYZING COLLECTIONS OF SCIENTIFIC PUBLICATIONS

Author(s):  
N. I. Tikhonov

Collections of scientific publications are growing rapidly. Scientists have access to portals containing a large number of documents. Such a large amount of data is difficult to investigate. Methods of document visualization are used to reduce labor costs, search for necessary and similar documents, evaluate the scientific contribution of certain publications and reveal hidden links between documents. The methods of document visualization can be based on various models of document representation. In recent years, word embedding methods for natural language processing have become extremely popular. Following them, methods for analyzing text collections began to appear to obtain vector representations of documents. Although there are many document analyzing systems, new methods can give new understandings of collections, have better performance for analyzing large collections of documents, or find new relationships between documents. This article discusses two methods Paper2vec and Cite2vec that get vector representations of documents using citation information. The text provides a brief description of the considered methods for analyzing collections of scientific publications, describes experiments with these methods, including the visualization of the results of the methods and a description of the problems that arise.

2021 ◽  
Vol 19 (3) ◽  
pp. 61-69
Author(s):  
N. I. Tikhonov

Visualizations are used to better understand collections of scientific publications. Various methods of analyzing text collections can be used to build these visualizations. This article discusses two methods Paper2vec and Cite2vec that get vector representations of documents using citation information. To demonstrate a work of these techniques and an example of their application, visualizations were developed, which are described in this paper.


Author(s):  
Nidheesh Melethadathil ◽  
Jaap Heringa ◽  
Bipin Nair ◽  
Shyam Diwakar

<strong>With the rapid growth in the numbers of scientific publications in domains such as neuroscience and medicine, visually interlinking documents in online databases such as PubMed with the purpose of indicating the context of a query results can improve the multi-disciplinary relevance of the search results. Translational medicine and systems biology rely on studies relating basic sciences to applications, often going through multiple disciplinary domains. This paper focuses on the design and development of a new scientific document visualization platform, which allows inferring translational aspects in biosciences within published articles using machine learning and natural language processing (NLP) methods. From online databases, this software platform effectively extracted relationship connections between multiple sub-domains within neuroscience derived from abstracts related to user query. In our current implementation, the document visualization platform employs two clustering algorithms namely Suffix Tree Clustering (STC) and LINGO. Clustering quality was improved by mapping top-ranked cluster labels derived from an UMLS-Metathesaurus using a scoring function. To avoid non-clustered documents, an iterative scheme, called auto-clustering was developed and this allowed mapping previously uncategorized documents during the initial grouping process to relevant clusters. The efficacy of this document clustering and visualization platform was evaluated by expert-based validation of clustering results obtained with unique search terms.  Compared to normal clustering, auto-clustering demonstrated better efficacy by generating larger numbers of unique and relevant cluster labels. Using this implementation, a Parkinson’s disease systems theory model was developed and studies based on user queries related to neuroscience and oncology have been showcased as applications.</strong>


2017 ◽  
Author(s):  
Sabrina Jaeger ◽  
Simone Fulle ◽  
Samo Turk

Inspired by natural language processing techniques we here introduce Mol2vec which is an unsupervised machine learning approach to learn vector representations of molecular substructures. Similarly, to the Word2vec models where vectors of closely related words are in close proximity in the vector space, Mol2vec learns vector representations of molecular substructures that are pointing in similar directions for chemically related substructures. Compounds can finally be encoded as vectors by summing up vectors of the individual substructures and, for instance, feed into supervised machine learning approaches to predict compound properties. The underlying substructure vector embeddings are obtained by training an unsupervised machine learning approach on a so-called corpus of compounds that consists of all available chemical matter. The resulting Mol2vec model is pre-trained once, yields dense vector representations and overcomes drawbacks of common compound feature representations such as sparseness and bit collisions. The prediction capabilities are demonstrated on several compound property and bioactivity data sets and compared with results obtained for Morgan fingerprints as reference compound representation. Mol2vec can be easily combined with ProtVec, which employs the same Word2vec concept on protein sequences, resulting in a proteochemometric approach that is alignment independent and can be thus also easily used for proteins with low sequence similarities.


2021 ◽  
pp. 105971232098304
Author(s):  
R Alexander Bentley ◽  
Joshua Borycz ◽  
Simon Carrignon ◽  
Damian J Ruck ◽  
Michael J O’Brien

The explosion of online knowledge has made knowledge, paradoxically, difficult to find. A web or journal search might retrieve thousands of articles, ranked in a manner that is biased by, for example, popularity or eigenvalue centrality rather than by informed relevance to the complex query. With hundreds of thousands of articles published each year, the dense, tangled thicket of knowledge grows even more entwined. Although natural language processing and new methods of generating knowledge graphs can extract increasingly high-level interpretations from research articles, the results are inevitably biased toward recent, popular, and/or prestigious sources. This is a result of the inherent nature of human social-learning processes. To preserve and even rediscover lost scientific ideas, we employ the theory that scientific progress is punctuated by means of inspired, revolutionary ideas at the origin of new paradigms. Using a brief case example, we suggest how phylogenetic inference might be used to rediscover potentially useful lost discoveries, as a way in which machines could help drive revolutionary science.


2021 ◽  
Vol 16 (1) ◽  
pp. 11
Author(s):  
Klaus Rechert ◽  
Jurek Oberhauser ◽  
Rafael Gieschke

Software and in particular source code became an important component of scientific publications and henceforth is now subject of research data management.  Maintaining source code such that it remains a usable and a valuable scientific contribution is and remains a huge task. Not all code contributions can be actively maintained forever. Eventually, there will be a significant backlog of legacy source-code. In this article we analyse the requirements for applying the concept of long-term reusability to source code. We use simple case study to identify gaps and provide a technical infrastructure based on emulator to support automated builds of historic software in form of source code.  


2021 ◽  
Vol 102 ◽  
pp. 02001
Author(s):  
Anja Wilhelm ◽  
Wolfgang Ziegler

The primary focus of technical communication (TC) in the past decade has been the system-assisted generation and utilization of standardized, structured, and classified content for dynamic output solutions. Nowadays, machine learning (ML) approaches offer a new opportunity to integrate unstructured data into existing knowledge bases without the need to manually organize information into topic-based content enriched with semantic metadata. To make the field of artificial intelligence (AI) more accessible for technical writers and content managers, cloud-based machine learning as a service (MLaaS) solutions provide a starting point for domain-specific ML modelling while unloading the modelling process from extensive coding, data processing and storage demands. Therefore, information architects can focus on information extraction tasks and on prospects to include pre-existing knowledge from other systems into the ML modelling process. In this paper, the capability and performance of a cloud-based ML service, IBM Watson, are analysed to assess their value for semantic context analysis. The ML model is based on a supervised learning method and features deep learning (DL) and natural language processing (NLP) techniques. The subject of the analysis is a corpus of scientific publications on the 2019 Coronavirus disease. The analysis focuses on information extractions regarding preventive measures and effects of the pandemic on healthcare workers.


Vector representations for language have been shown to be useful in a number of Natural Language Processing tasks. In this paper, we aim to investigate the effectiveness of word vector representations for the problem of Sentiment Analysis. In particular, we target three sub-tasks namely sentiment words extraction, polarity of sentiment words detection, and text sentiment prediction. We investigate the effectiveness of vector representations over different text data and evaluate the quality of domain-dependent vectors. Vector representations has been used to compute various vector-based features and conduct systematically experiments to demonstrate their effectiveness. Using simple vector based features can achieve better results for text sentiment analysis of APP.


2020 ◽  
Vol 34 (08) ◽  
pp. 13369-13381
Author(s):  
Shivashankar Subramanian ◽  
Ioana Baldini ◽  
Sushma Ravichandran ◽  
Dmitriy A. Katz-Rogozhnikov ◽  
Karthikeyan Natesan Ramamurthy ◽  
...  

More than 200 generic drugs approved by the U.S. Food and Drug Administration for non-cancer indications have shown promise for treating cancer. Due to their long history of safe patient use, low cost, and widespread availability, repurposing of these drugs represents a major opportunity to rapidly improve outcomes for cancer patients and reduce healthcare costs. In many cases, there is already evidence of efficacy for cancer, but trying to manually extract such evidence from the scientific literature is intractable. In this emerging applications paper, we introduce a system to automate non-cancer generic drug evidence extraction from PubMed abstracts. Our primary contribution is to define the natural language processing pipeline required to obtain such evidence, comprising the following modules: querying, filtering, cancer type entity extraction, therapeutic association classification, and study type classification. Using the subject matter expertise on our team, we create our own datasets for these specialized domain-specific tasks. We obtain promising performance in each of the modules by utilizing modern language processing techniques and plan to treat them as baseline approaches for future improvement of individual components.


2022 ◽  
Vol 3 (1) ◽  
pp. 1-16
Author(s):  
Haoran Ding ◽  
Xiao Luo

Searching, reading, and finding information from the massive medical text collections are challenging. A typical biomedical search engine is not feasible to navigate each article to find critical information or keyphrases. Moreover, few tools provide a visualization of the relevant phrases to the query. However, there is a need to extract the keyphrases from each document for indexing and efficient search. The transformer-based neural networks—BERT has been used for various natural language processing tasks. The built-in self-attention mechanism can capture the associations between words and phrases in a sentence. This research investigates whether the self-attentions can be utilized to extract keyphrases from a document in an unsupervised manner and identify relevancy between phrases to construct a query relevancy phrase graph to visualize the search corpus phrases on their relevancy and importance. The comparison with six baseline methods shows that the self-attention-based unsupervised keyphrase extraction works well on a medical literature dataset. This unsupervised keyphrase extraction model can also be applied to other text data. The query relevancy graph model is applied to the COVID-19 literature dataset and to demonstrate that the attention-based phrase graph can successfully identify the medical phrases relevant to the query terms.


Sign in / Sign up

Export Citation Format

Share Document