Latent Semantic Analysis Boosted Convolutional Neural Networks for Document Classification

A parsimonious convolutional neural network (CNN) for text document classification that replicates the ease of use and high classification performance of linear methods is presented. This new CNN architecture can leverage locally trained latent semantic analysis (LSA) word vectors. The architecture is based on parallel 1D convolutional layers with small window sizes, ranging from 1 to 5 words. To test the efficacy of the new CNN architecture, three balanced text datasets that are known to perform exceedingly well with linear classifiers were evaluated. Also, three additional imbalanced datasets were evaluated to gauge the robustness of the LSA vectors and small window sizes. The new CNN architecture consisting of 1 to 4-grams, coupled with LSA word vectors, exceeded the accuracy of all linear classifiers on balanced datasets with an average improvement of 0.73%. In four out of the total six datasets, the LSA word vectors provided a maximum classification performance on par with or better than word2vec vectors in CNNs. Furthermore, in four out of the six datasets, the new CNN architecture provided the highest classification performance. Thus, the new CNN architecture and LSA word vectors could be used as a baseline method for text classification tasks.

Download Full-text

Comparison of Support Vector Machines With and Without Latent Semantic Analysis for Document Classification

Data Management, Analytics and Innovation - Advances in Intelligent Systems and Computing ◽

10.1007/978-981-13-1402-5_20 ◽

2018 ◽

pp. 263-274

Author(s):

Vaibhav Khatavkar ◽

Parag Kulkarni

Keyword(s):

Support Vector Machines ◽

Latent Semantic Analysis ◽

Semantic Analysis ◽

Document Classification ◽

Support Vector ◽

Vector Machines

Download Full-text

Neural networks for latent semantic analysis

2000 26th Annual Conference of the IEEE Industrial Electronics Society. IECON 2000. 2000 IEEE International Conference on Industrial Electronics, Control and Instrumentation. 21st Century Technologies and Industrial Opportunities (Cat. No.00CH37141) ◽

10.1109/iecon.2000.972293 ◽

2002 ◽

Author(s):

R. Thawonmas ◽

J.-I. Hirayama ◽

T. Tomoike ◽

A. Sakamoto

Keyword(s):

Neural Networks ◽

Latent Semantic Analysis ◽

Semantic Analysis

Download Full-text

Study of Statistical Text Representation Methods for Performance Improvement of a Hierarchical Attention Network

Applied Sciences ◽

10.3390/app11136113 ◽

2021 ◽

Vol 11 (13) ◽

pp. 6113

Author(s):

Adam Wawrzyński ◽

Julian Szymański

Keyword(s):

Neural Networks ◽

Latent Dirichlet Allocation ◽

Semantic Analysis ◽

Document Classification ◽

Bag Of Words ◽

Text Representation ◽

Attention Network ◽

Document Frequency ◽

Textual Data ◽

Latent Representations

To effectively process textual data, many approaches have been proposed to create text representations. The transformation of a text into a form of numbers that can be computed using computers is crucial for further applications in downstream tasks such as document classification, document summarization, and so forth. In our work, we study the quality of text representations using statistical methods and compare them to approaches based on neural networks. We describe in detail nine different algorithms used for text representation and then we evaluate five diverse datasets: BBCSport, BBC, Ohsumed, 20Newsgroups, and Reuters. The selected statistical models include Bag of Words (BoW), Term Frequency-Inverse Document Frequency (TFIDF) weighting, Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA). For the second group of deep neural networks, Partition-Smooth Inverse Frequency (P-SIF), Doc2Vec-Distributed Bag of Words Paragraph Vector (Doc2Vec-DBoW), Doc2Vec-Memory Model of Paragraph Vectors (Doc2Vec-DM), Hierarchical Attention Network (HAN) and Longformer were selected. The text representation methods were benchmarked in the document classification task and BoW and TFIDF models were used were used as a baseline. Based on the identified weaknesses of the HAN method, an improvement in the form of a Hierarchical Weighted Attention Network (HWAN) was proposed. The incorporation of statistical features into HAN latent representations improves or provides comparable results on four out of five datasets. The article presents how the length of the processed text affects the results of HAN and variants of HWAN models.

Download Full-text