Interpretable Document Representations for Fast and Accurate Retrieval of Mathematical Information

Author(s):  
Vít Novotný
Author(s):  
Sebastian Arnold ◽  
Betty van Aken ◽  
Paul Grundmann ◽  
Felix A. Gers ◽  
Alexander Löser

2008 ◽  
Vol 18 (1) ◽  
pp. 123-138 ◽  
Author(s):  
Milos Radovanovic ◽  
Mirjana Ivanovic

Motivated by applying Text Categorization to classification of Web search results, this paper describes an extensive experimental study of the impact of bag-of- words document representations on the performance of five major classifiers - Na?ve Bayes, SVM, Voted Perceptron, kNN and C4.5. The texts, representing short Web-page descriptions sorted into a large hierarchy of topics, are taken from the dmoz Open Directory Web-page ontology, and classifiers are trained to automatically determine the topics which may be relevant to a previously unseen Web-page. Different transformations of input data: stemming, normalization, logtf and idf, together with dimensionality reduction, are found to have a statistically significant improving or degrading effect on classification performance measured by classical metrics - accuracy, precision, recall, F1 and F2. The emphasis of the study is not on determining the best document representation which corresponds to each classifier, but rather on describing the effects of every individual transformation on classification, together with their mutual relationships. .


2019 ◽  
Vol 477 ◽  
pp. 15-29 ◽  
Author(s):  
Donghwa Kim ◽  
Deokseong Seo ◽  
Suhyoun Cho ◽  
Pilsung Kang

Author(s):  
Bo Xu ◽  
Hongfei Lin ◽  
Lin Wang ◽  
Yuan Lin ◽  
Kan Xu ◽  
...  

2018 ◽  
Vol 29 (1) ◽  
pp. 1109-1121
Author(s):  
Mohsen Pourvali ◽  
Salvatore Orlando

Abstract This paper explores a multi-strategy technique that aims at enriching text documents for improving clustering quality. We use a combination of entity linking and document summarization in order to determine the identity of the most salient entities mentioned in texts. To effectively enrich documents without introducing noise, we limit ourselves to the text fragments mentioning the salient entities, in turn, belonging to a knowledge base like Wikipedia, while the actual enrichment of text fragments is carried out using WordNet. To feed clustering algorithms, we investigate different document representations obtained using several combinations of document enrichment and feature extraction. This allows us to exploit ensemble clustering, by combining multiple clustering results obtained using different document representations. Our experiments indicate that our novel enriching strategies, combined with ensemble clustering, can improve the quality of classical text clustering when applied to text corpora like The British Broadcasting Corporation (BBC) NEWS.


2020 ◽  
Vol 34 (05) ◽  
pp. 9024-9031
Author(s):  
Pingjie Tang ◽  
Meng Jiang ◽  
Bryan (Ning) Xia ◽  
Jed W. Pitera ◽  
Jeffrey Welser ◽  
...  

Patent categorization, which is to assign multiple International Patent Classification (IPC) codes to a patent document, relies heavily on expert efforts, as it requires substantial domain knowledge. When formulated as a multi-label text classification (MTC) problem, it draws two challenges to existing models: one is to learn effective document representations from text content; the other is to model the cross-section behavior of label set. In this work, we propose a label attention model based on graph convolutional network. It jointly learns the document-word associations and word-word co-occurrences to generate rich semantic embeddings of documents. It employs a non-local attention mechanism to learn label representations in the same space of document representations for multi-label classification. On a large CIRCA patent database, we evaluate the performance of our model and as many as seven competitive baselines. We find that our model outperforms all those prior state of the art by a large margin and achieves high performance on P@k and nDCG@k.


Sign in / Sign up

Export Citation Format

Share Document