document representation
Recently Published Documents


TOTAL DOCUMENTS

198
(FIVE YEARS 61)

H-INDEX

15
(FIVE YEARS 4)

Author(s):  
N. I. Tikhonov

Collections of scientific publications are growing rapidly. Scientists have access to portals containing a large number of documents. Such a large amount of data is difficult to investigate. Methods of document visualization are used to reduce labor costs, search for necessary and similar documents, evaluate the scientific contribution of certain publications and reveal hidden links between documents. The methods of document visualization can be based on various models of document representation. In recent years, word embedding methods for natural language processing have become extremely popular. Following them, methods for analyzing text collections began to appear to obtain vector representations of documents. Although there are many document analyzing systems, new methods can give new understandings of collections, have better performance for analyzing large collections of documents, or find new relationships between documents. This article discusses two methods Paper2vec and Cite2vec that get vector representations of documents using citation information. The text provides a brief description of the considered methods for analyzing collections of scientific publications, describes experiments with these methods, including the visualization of the results of the methods and a description of the problems that arise.


Author(s):  
Jeow Li Huan ◽  
Arif Ahmed Sekh ◽  
Chai Quek ◽  
Dilip K. Prasad

AbstractText classification is one of the widely used phenomena in different natural language processing tasks. State-of-the-art text classifiers use the vector space model for extracting features. Recent progress in deep models, recurrent neural networks those preserve the positional relationship among words achieve a higher accuracy. To push text classification accuracy even higher, multi-dimensional document representation, such as vector sequences or matrices combined with document sentiment, should be explored. In this paper, we show that documents can be represented as a sequence of vectors carrying semantic meaning and classified using a recurrent neural network that recognizes long-range relationships. We show that in this representation, additional sentiment vectors can be easily attached as a fully connected layer to the word vectors to further improve classification accuracy. On the UCI sentiment labelled dataset, using the sequence of vectors alone achieved an accuracy of 85.6%, which is better than 80.7% from ridge regression classifier—the best among the classical technique we tested. Additional sentiment information further increases accuracy to 86.3%. On our suicide notes dataset, the best classical technique—the Naíve Bayes Bernoulli classifier, achieves accuracy of 71.3%, while our classifier, incorporating semantic and sentiment information, exceeds that at 75% accuracy.


2021 ◽  
Vol 445 ◽  
pp. 276-286
Author(s):  
Peng Yan ◽  
Linjing Li ◽  
Miaotianzi Jin ◽  
Daniel Zeng

Author(s):  
Soe Soe Lwin ◽  
Khin Thandar Nwet

There is enormous amount information available in different forms of sources and genres. In order to extract useful information from a massive amount of data, automatic mechanism is required. The text summarization systems assist with content reduction keeping the important information and filtering the non-important parts of the text. Good document representation is really important in text summarization to get relevant information. Bag-of-words cannot give word similarity on syntactic and semantic relationship. Word embedding can give good document representation to capture and encode the semantic relation between words. Therefore, centroid based on word embedding representation is employed in this paper. Myanmar news summarization based on different word embedding is proposed. In this paper, Myanmar local and international news are summarized using centroid-based word embedding summarizer using the effectiveness of word representation approach, word embedding. Experiments were done on Myanmar local and international news dataset using different word embedding models and the results are compared with performance of bag-of-words summarization. Centroid summarization using word embedding performs comprehensively better than centroid summarization using bag-of-words.


Sensors ◽  
2021 ◽  
Vol 21 (11) ◽  
pp. 3747
Author(s):  
Manar Mohamed Hafez ◽  
Rebeca P. Díaz Redondo ◽  
Ana Fernández Vilas ◽  
Héctor Olivera Pazó

With the exponential increase in information, it has become imperative to design mechanisms that allow users to access what matters to them as quickly as possible. The recommendation system (RS) with information technology development is the solution, it is an intelligent system. Various types of data can be collected on items of interest to users and presented as recommendations. RS also play a very important role in e-commerce. The purpose of recommending a product is to designate the most appropriate designation for a specific product. The major challenge when recommending products is insufficient information about the products and the categories to which they belong. In this paper, we transform the product data using two methods of document representation: bag-of-words (BOW) and the neural network-based document combination known as vector-based (Doc2Vec). We propose three-criteria recommendation systems (product, package and health) for each document representation method to foster online grocery shopping, which depends on product characteristics such as composition, packaging, nutrition table, allergen, and so forth. For our evaluation, we conducted a user and expert survey. Finally, we compared the performance of these three criteria for each document representation method, discovering that the neural network-based (Doc2Vec) performs better and completely alters the results.


Author(s):  
Qianqian Xie ◽  
Jimin Huang ◽  
Pan Du ◽  
Min Peng ◽  
Jian-Yun Nie

Sign in / Sign up

Export Citation Format

Share Document