scholarly journals Measuring novelty in science with word embedding

PLoS ONE ◽  
2021 ◽  
Vol 16 (7) ◽  
pp. e0254034
Author(s):  
Sotaro Shibayama ◽  
Deyun Yin ◽  
Kuniko Matsumoto

Novelty is a core value in science, and a reliable measurement of novelty is crucial. This study proposes a new approach of measuring the novelty of scientific articles based on both citation data and text data. The proposed approach considers an article to be novel if it cites a combination of semantically distant references. To this end, we first assign a word embedding–a vector representation of each vocabulary–to each cited reference on the basis of text information included in the reference. With these vectors, a distance between every pair of references is computed. Finally, the novelty of a focal document is evaluated by summarizing the distances between all references. The approach draws on limited text information (the titles of references) and publicly shared library for word embeddings, which minimizes the requirement of data access and computational cost. We share the code, with which one can compute the novelty score of a document of interest only by having the focal document’s reference list. We validate the proposed measure through three exercises. First, we confirm that word embeddings can be used to quantify semantic distances between documents by comparing with an established bibliometric distance measure. Second, we confirm the criterion-related validity of the proposed novelty measure with self-reported novelty scores collected from a questionnaire survey. Finally, as novelty is known to be correlated with future citation impact, we confirm that the proposed measure can predict future citation.

2018 ◽  
Vol 15 (4) ◽  
pp. 29-44 ◽  
Author(s):  
Yi Zhao ◽  
Chong Wang ◽  
Jian Wang ◽  
Keqing He

With the rapid growth of web services on the internet, web service discovery has become a hot topic in services computing. Faced with the heterogeneous and unstructured service descriptions, many service clustering approaches have been proposed to promote web service discovery, and many other approaches leveraged auxiliary features to enhance the classical LDA model to achieve better clustering performance. However, these extended LDA approaches still have limitations in processing data sparsity and noise words. This article proposes a novel web service clustering approach by incorporating LDA with word embedding, which leverages relevant words obtained based on word embedding to improve the performance of web service clustering. Especially, the semantically relevant words of service keywords by Word2vec were used to train the word embeddings and then incorporated into the LDA training process. Finally, experiments conducted on a real-world dataset published on ProgrammableWeb show that the authors' proposed approach can achieve better clustering performance than several classical approaches.


2021 ◽  
Vol 83 (1) ◽  
pp. 72-79
Author(s):  
O.A. Kan ◽  
◽  
N.A. Mazhenov ◽  
K.B. Kopbalina ◽  
G.B. Turebaeva ◽  
...  

The main problem: The article deals with the issues of hiding text information in a graphic file. A formula for hiding text information in image pixels is proposed. A steganography scheme for embedding secret text in random image pixels has been developed. Random bytes are pre-embedded in each row of pixels in the source image. As a result of the operations performed, a key image is obtained. The text codes are embedded in random bytes of pixels of a given RGB channel. To form a secret message, the characters of the ASCII code table are used. Demo encryption and decryption programs have been developed in the Python 3.5.2 programming language. A graphic file is used as the decryption key. Purpose: To develop an algorithm for embedding text information in random pixels of an image. Methods: Among the methods of hiding information in graphic images, the LSB method of hiding information is widely used, in which the lower bits in the image bytes responsible for color encoding are replaced by the bits of the secret message. Analysis of methods of hiding information in graphic files and modeling of algorithms showed an increase in the level of protection of hidden information from detection. Results and their significance: Using the proposed steganography scheme and the algorithm for embedding bytes of a secret message in a graphic file, protection against detection of hidden information is significantly increased. The advantage of this steganography scheme is that for decryption, a key image is used, in which random bytes are pre-embedded. In addition, the entire pixel bits of the container image are used to display the color shades. It can also be noted that the developed steganography scheme allows not only to transmit secret information, but also to add digital fingerprints or hidden tags to the image.


Author(s):  
Miroslav Kubát ◽  
Jan Hůla ◽  
Xinying Chen ◽  
Radek Čech ◽  
Jiří Milička

AbstractThis is a pilot study of usability of Context Specificity measure for stylometric purposes. Specifically, the word embedding Word2vec approach based on measuring lexical context similarity between lemmas is applied to the analysis of texts that belong to different styles. Three types of Czech texts are investigated: fiction, non-fiction, and journalism. Specifically, forty lemmas were observed (10 lemmas each for verbs, nouns, adjectives, and adverbs). The aim of the present study is to introduce a concept of the Context Specificity and to test whether this measurement is sensitive to different styles. The results show that the proposed method Closest Context Specificity (CCS) is a corpus size independent method which has a promising potential in analyzing different styles.


2021 ◽  
Vol 2083 (4) ◽  
pp. 042044
Author(s):  
Zuhua Dai ◽  
Yuanyuan Liu ◽  
Shilong Di ◽  
Qi Fan

Abstract Aspect level sentiment analysis belongs to fine-grained sentiment analysis, w hich has caused extensive research in academic circles in recent years. For this task, th e recurrent neural network (RNN) model is usually used for feature extraction, but the model cannot effectively obtain the structural information of the text. Recent studies h ave begun to use the graph convolutional network (GCN) to model the syntactic depen dency tree of the text to solve this problem. For short text data, the text information is not enough to accurately determine the emotional polarity of the aspect words, and the knowledge graph is not effectively used as external knowledge that can enrich the sem antic information. In order to solve the above problems, this paper proposes a graph co nvolutional neural network (GCN) model that can process syntactic information, know ledge graphs and text semantic information. The model works on the “syntax-knowled ge” graph to extract syntactic information and common sense information at the same t ime. Compared with the latest model, the model in this paper can effectively improve t he accuracy of aspect-level sentiment classification on two datasets.


Author(s):  
Xiang Lisa Li ◽  
Jason Eisner

Pre-trained word embeddings like ELMo and BERT contain rich syntactic and semantic information, resulting in state-of-the-art performance on various tasks. We propose a very fast variational information bottleneck (VIB) method to nonlinearly compress these embeddings, keeping only the information that helps a discriminative parser. We compress each word embedding to either a discrete tag or a continuous vector. In the discrete version, our automatically compressed tags form an alternative tag set: we show experimentally that our tags capture most of the information in traditional POS tag annotations, but our tag sequences can be parsed more accurately at the same level of tag granularity. In the continuous version, we show experimentally that moderately compressing the word embeddings by our method yields a more accurate parser in 8 of 9 languages, unlike simple dimensionality reduction.


Author(s):  
Farhad Bin Siddique ◽  
Dario Bertero ◽  
Pascale Fung

We propose a multilingual model to recognize Big Five Personality traits from text data in four different languages: English, Spanish, Dutch and Italian. Our analysis shows that words having a similar semantic meaning in different languages do not necessarily correspond to the same personality traits. Therefore, we propose a personality alignment method, GlobalTrait, which has a mapping for each trait from the source language to the target language (English), such that words that correlate positively to each trait are close together in the multilingual vector space. Using these aligned embeddings for training, we can transfer personality related training features from high-resource languages such as English to other low-resource languages, and get better multilingual results, when compared to using simple monolingual and unaligned multilingual embeddings. We achieve an average F-score increase (across all three languages except English) from 65 to 73.4 (+8.4), when comparing our monolingual model to multilingual using CNN with personality aligned embeddings. We also show relatively good performance in the regression tasks, and better classification results when evaluating our model on a separate Chinese dataset.


Author(s):  
Tatsunori B. Hashimoto ◽  
David Alvarez-Melis ◽  
Tommi S. Jaakkola

Continuous word representations have been remarkably useful across NLP tasks but remain poorly understood. We ground word embeddings in semantic spaces studied in the cognitive-psychometric literature, taking these spaces as the primary objects to recover. To this end, we relate log co-occurrences of words in large corpora to semantic similarity assessments and show that co-occurrences are indeed consistent with an Euclidean semantic space hypothesis. Framing word embedding as metric recovery of a semantic space unifies existing word embedding algorithms, ties them to manifold learning, and demonstrates that existing algorithms are consistent metric recovery methods given co-occurrence counts from random walks. Furthermore, we propose a simple, principled, direct metric recovery algorithm that performs on par with the state-of-the-art word embedding and manifold learning methods. Finally, we complement recent focus on analogies by constructing two new inductive reasoning datasets—series completion and classification—and demonstrate that word embeddings can be used to solve them as well.


2018 ◽  
Vol 6 (3) ◽  
pp. 67-78
Author(s):  
Tian Nie ◽  
Yi Ding ◽  
Chen Zhao ◽  
Youchao Lin ◽  
Takehito Utsuro

The background of this article is the issue of how to overview the knowledge of a given query keyword. Especially, the authors focus on concerns of those who search for web pages with a given query keyword. The Web search information needs of a given query keyword is collected through search engine suggests. Given a query keyword, the authors collect up to around 1,000 suggests, while many of them are redundant. They classify redundant search engine suggests based on a topic model. However, one limitation of the topic model based classification of search engine suggests is that the granularity of the topics, i.e., the clusters of search engine suggests, is too coarse. In order to overcome the problem of the coarse-grained classification of search engine suggests, this article further applies the word embedding technique to the webpages used during the training of the topic model, in addition to the text data of the whole Japanese version of Wikipedia. Then, the authors examine the word embedding based similarity between search engines suggests and further classify search engine suggests within a single topic into finer-grained subtopics based on the similarity of word embeddings. Evaluation results prove that the proposed approach performs well in the task of subtopic classification of search engine suggests.


2020 ◽  
Vol 10 (19) ◽  
pp. 6893
Author(s):  
Yerai Doval ◽  
Jesús Vilares ◽  
Carlos Gómez-Rodríguez

Research on word embeddings has mainly focused on improving their performance on standard corpora, disregarding the difficulties posed by noisy texts in the form of tweets and other types of non-standard writing from social media. In this work, we propose a simple extension to the skipgram model in which we introduce the concept of bridge-words, which are artificial words added to the model to strengthen the similarity between standard words and their noisy variants. Our new embeddings outperform baseline models on noisy texts on a wide range of evaluation tasks, both intrinsic and extrinsic, while retaining a good performance on standard texts. To the best of our knowledge, this is the first explicit approach at dealing with these types of noisy texts at the word embedding level that goes beyond the support for out-of-vocabulary words.


Sign in / Sign up

Export Citation Format

Share Document