Interpretable Document Representations for Fast and Accurate Retrieval of Mathematical Information

Motivated by applying Text Categorization to classification of Web search results, this paper describes an extensive experimental study of the impact of bag-of- words document representations on the performance of five major classifiers - Na?ve Bayes, SVM, Voted Perceptron, kNN and C4.5. The texts, representing short Web-page descriptions sorted into a large hierarchy of topics, are taken from the dmoz Open Directory Web-page ontology, and classifiers are trained to automatically determine the topics which may be relevant to a previously unseen Web-page. Different transformations of input data: stemming, normalization, logtf and idf, together with dimensionality reduction, are found to have a statistically significant improving or degrading effect on classification performance measured by classical metrics - accuracy, precision, recall, F1 and F2. The emphasis of the study is not on determining the best document representation which corresponds to each classifier, but rather on describing the effects of every individual transformation on classification, together with their mutual relationships. .

Download Full-text

Multi-co-training for document classification using various document representations: TF–IDF, LDA, and Doc2Vec

Information Sciences ◽

10.1016/j.ins.2018.10.006 ◽

2019 ◽

Vol 477 ◽

pp. 15-29 ◽

Cited By ~ 60

Author(s):

Donghwa Kim ◽

Deokseong Seo ◽

Suhyoun Cho ◽

Pilsung Kang

Keyword(s):

Document Classification ◽

Document Representations

Download Full-text

Tripartite-Replicated Softmax Model for Document Representations

Lecture Notes in Computer Science - Information Retrieval ◽

10.1007/978-3-319-68699-8_9 ◽

2017 ◽

pp. 109-121

Author(s):

Bo Xu ◽

Hongfei Lin ◽

Lin Wang ◽

Yuan Lin ◽

Kan Xu ◽

...

Keyword(s):

Document Representations

Download Full-text

Fusing Various Document Representations for Comparative Text Identification from Product Reviews

10.1007/978-3-030-87571-8_46 ◽

2021 ◽

pp. 531-543

Author(s):

Jing Liu ◽

Xiaoying Wang ◽

Lihua Huang

Keyword(s):

Product Reviews ◽

Document Representations

Download Full-text

Document Representations (Inclusive Native and Relational)

10.1007/springerreference_63278 ◽

2011 ◽

Keyword(s):

Document Representations

Download Full-text

Enriching Documents by Linking Salient Entities and Lexical-Semantic Expansion

Journal of Intelligent Systems ◽

10.1515/jisys-2018-0098 ◽

2018 ◽

Vol 29 (1) ◽

pp. 1109-1121

Author(s):

Mohsen Pourvali ◽

Salvatore Orlando

Keyword(s):

Clustering Algorithms ◽

Ensemble Clustering ◽

British Broadcasting Corporation ◽

Text Documents ◽

Classical Text ◽

Text Corpora ◽

Clustering Quality ◽

Semantic Expansion ◽

Document Representations

Abstract This paper explores a multi-strategy technique that aims at enriching text documents for improving clustering quality. We use a combination of entity linking and document summarization in order to determine the identity of the most salient entities mentioned in texts. To effectively enrich documents without introducing noise, we limit ourselves to the text fragments mentioning the salient entities, in turn, belonging to a knowledge base like Wikipedia, while the actual enrichment of text fragments is carried out using WordNet. To feed clustering algorithms, we investigate different document representations obtained using several combinations of document enrichment and feature extraction. This allows us to exploit ensemble clustering, by combining multiple clustering results obtained using different document representations. Our experiments indicate that our novel enriching strategies, combined with ensemble clustering, can improve the quality of classical text clustering when applied to text corpora like The British Broadcasting Corporation (BBC) NEWS.

Download Full-text

Supervised Extraction of Usage Patterns in Different Document Representations

Solving Large Scale Learning Tasks. Challenges and Algorithms - Lecture Notes in Computer Science ◽

10.1007/978-3-319-41706-6_19 ◽

2016 ◽

pp. 346-361

Author(s):

Christian Pölitz

Keyword(s):

Usage Patterns ◽

Document Representations

Download Full-text

Multi-Label Patent Categorization with Non-Local Attention-Based Graph Convolutional Network

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i05.6435 ◽

2020 ◽

Vol 34 (05) ◽

pp. 9024-9031

Author(s):

Pingjie Tang ◽

Meng Jiang ◽

Bryan (Ning) Xia ◽

Jed W. Pitera ◽

Jeffrey Welser ◽

...

Keyword(s):

Domain Knowledge ◽

High Performance ◽

State Of The Art ◽

Convolutional Network ◽

Attention Model ◽

Patent Classification ◽

International Patent ◽

Non Local ◽

Text Content ◽

Document Representations

Patent categorization, which is to assign multiple International Patent Classification (IPC) codes to a patent document, relies heavily on expert efforts, as it requires substantial domain knowledge. When formulated as a multi-label text classification (MTC) problem, it draws two challenges to existing models: one is to learn effective document representations from text content; the other is to model the cross-section behavior of label set. In this work, we propose a label attention model based on graph convolutional network. It jointly learns the document-word associations and word-word co-occurrences to generate rich semantic embeddings of documents. It employs a non-local attention mechanism to learn label representations in the same space of document representations for multi-label classification. On a large CIRCA patent database, we evaluate the performance of our model and as many as seven competitive baselines. We find that our model outperforms all those prior state of the art by a large margin and achieves high performance on P@k and nDCG@k.

Download Full-text