Document classification using convolutional neural networks with small window sizes and latent semantic analysis

A parsimonious convolutional neural network (CNN) for text document classification that replicates the ease of use and high classification performance of linear methods is presented. This new CNN architecture can leverage locally trained latent semantic analysis (LSA) word vectors. The architecture is based on parallel 1D convolutional layers with small window sizes, ranging from 1 to 5 words. To test the efficacy of the new CNN architecture, three balanced text datasets that are known to perform exceedingly well with linear classifiers were evaluated. Also, three additional imbalanced datasets were evaluated to gauge the robustness of the LSA vectors and small window sizes. The new CNN architecture consisting of 1 to 4-grams, coupled with LSA word vectors, exceeded the accuracy of all linear classifiers on balanced datasets with an average improvement of 0.73%. In four out of the total six datasets, the LSA word vectors provided a maximum classification performance on par with or better than word2vec vectors in CNNs. Furthermore, in four out of the six datasets, the new CNN architecture provided the highest classification performance. Thus, the new CNN architecture and LSA word vectors could be used as a baseline method for text classification tasks.

Download Full-text

Hybrid Neural Architecture for Intelligent Recommender System Classification Unit Design

Intelligent Techniques in Recommendation Systems ◽

10.4018/978-1-4666-2542-6.ch010 ◽

2013 ◽

pp. 192-213

Author(s):

Emmanuel Buabin

Keyword(s):

Recommender System ◽

Document Classification ◽

Research Field ◽

Neural Systems ◽

Fully Integrated ◽

Text Document ◽

Unit Design ◽

Boosting Algorithms ◽

New Research ◽

Text Document Classification

The objective is intelligent recommender system classification unit design using hybrid neural techniques. In particular, a neuroscience-based hybrid neural by Buabin (2011a) is introduced, explained, and examined for its potential in real world text document classification on the modapte version of the Reuters news text corpus. The so described neuroscience model (termed Hy-RNC) is fully integrated with a novel boosting algorithm to augment text document classification purposes. Hy-RNC outperforms existing works and opens up an entirely new research field in the area of machine learning. The main contribution of this book chapter is the provision of a step-by-step approach to modeling the hybrid system using underlying concepts such as boosting algorithms, recurrent neural networks, and hybrid neural systems. Results attained in the experiments show impressive performance by the hybrid neural classifier even with a minimal number of neurons in constituting structures.

Download Full-text

A Novel Inherent Distinguishing Feature Selector for Highly Skewed Text Document Classification

Arabian Journal for Science and Engineering ◽

10.1007/s13369-020-04763-5 ◽

2020 ◽

Vol 45 (12) ◽

pp. 10471-10491

Author(s):

Muhammad Sajid Ali ◽

Kashif Javed

Keyword(s):

Document Classification ◽

Text Document ◽

Feature Selector ◽

Text Document Classification

Download Full-text

Hindi Text Document Classification System Using SVM and Fuzzy

International Journal of Rough Sets and Data Analysis ◽

10.4018/ijrsda.2018100101 ◽

2018 ◽

Vol 5 (4) ◽

pp. 1-31 ◽

Cited By ~ 8

Author(s):

Shalini Puri ◽

Satya Prakash Singh

Keyword(s):

Classification System ◽

Character Recognition ◽

Optical Character Recognition ◽

Document Classification ◽

Data Availability ◽

Support Vector ◽

Handwritten Documents ◽

Text Document ◽

Survey Report ◽

Text Document Classification

In recent years, many information retrieval, character recognition, and feature extraction methodologies in Devanagari and especially in Hindi have been proposed for different domain areas. Due to enormous scanned data availability and to provide an advanced improvement of existing Hindi automated systems beyond optical character recognition, a new idea of Hindi printed and handwritten document classification system using support vector machine and fuzzy logic is introduced. This first pre-processes and then classifies textual imaged documents into predefined categories. With this concept, this article depicts a feasibility study of such systems with the relevance of Hindi, a survey report of statistical measurements of Hindi keywords obtained from different sources, and the inherent challenges found in printed and handwritten documents. The technical reviews are provided and graphically represented to compare many parameters and estimate contents, forms and classifiers used in various existing techniques.

Download Full-text

Latent Semantic Analysis Boosted Convolutional Neural Networks for Document Classification

2018 5th International Conference on Behavioral, Economic, and Socio-Cultural Computing (BESC) ◽

10.1109/besc.2018.8697314 ◽

2018 ◽

Cited By ~ 1

Author(s):

Eren Gultepe ◽

Mehran Kamkarhaghighi ◽

Masoud Makrehchi

Keyword(s):

Neural Networks ◽

Convolutional Neural Networks ◽

Latent Semantic Analysis ◽

Semantic Analysis ◽

Document Classification

Download Full-text

Automatic document classification based on latent semantic analysis

Programming and Computer Software ◽

10.1007/bf02759469 ◽

2000 ◽

Vol 26 (4) ◽

pp. 199-206 ◽

Cited By ~ 2

Author(s):

I. Kuralenok ◽

I. Nekrest'yanov

Keyword(s):

Latent Semantic Analysis ◽

Semantic Analysis ◽

Document Classification ◽

Automatic Document Classification

Download Full-text

Cross-Lingual Short-Text Document Classification for Facebook Comments

2014 International Conference on Future Internet of Things and Cloud ◽

10.1109/ficloud.2014.99 ◽

2014 ◽

Cited By ~ 17

Author(s):

Mosab Faqeeh ◽

Nawaf Abdulla ◽

Mahmoud Al-Ayyoub ◽

Yaser Jararweh ◽

Muhannad Quwaider

Keyword(s):

Document Classification ◽

Short Text ◽

Text Document ◽

Cross Lingual ◽

Text Document Classification

Download Full-text

Text Document Classification using a hybrid approach of ACOGA for feature selection

International Journal of Advanced Intelligence Paradigms ◽

10.1504/ijaip.2018.10021822 ◽

2018 ◽

Vol 1 (1) ◽

pp. 1

Author(s):

Anoj Kumar ◽

Avjeet Singh

Keyword(s):

Feature Selection ◽

Hybrid Approach ◽

Document Classification ◽

Text Document ◽

Text Document Classification

Download Full-text

Text Document Classification Using Support Vector Machine with Feature Selection Using Singular Value Decomposition

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.905.528 ◽

2014 ◽

Vol 905 ◽

pp. 528-532

Author(s):

Hoan Manh Dau ◽

Ning Xu

Keyword(s):

Support Vector Machine ◽

Singular Value Decomposition ◽

Decision Tree ◽

Document Classification ◽

Singular Value ◽

Multidimensional Data ◽

Support Vector ◽

Text Document ◽

Text Document Classification ◽

Value Decomposition

Text document classification is content analysis task of the text document and then giving decision (or giving a prediction) whether this text document belongs to which group among given text document ones. There are many classification techniques such as decision method basing on Naive Bayer, decision tree, k-Nearest neighbor (KNN), neural network, Support Vector Machine (SVM) method. Among those techniques, SVM is considered the popular and powerful one, especially, it is suitable to huge and multidimensional data classification. Text document classification with characteristics of very huge dimensional numbers and selecting features before classifying impact the classification results. Support Vector Machine is a very effective method in this field. This article studies Support Vector Machine and applies it in the problem of text document classification. The study shows that Support Vector Machine method with choosing features by singular value decomposition (SVD) method is better than other methods and decision tree.

Download Full-text

Boosted Hybrid Recurrent Neural Classifier for Text Document Classification on the Reuters News Text Corpus

International Journal of Machine Learning and Computing ◽

10.7763/ijmlc.2012.v2.195 ◽

2012 ◽

pp. 588-592 ◽

Cited By ~ 14

Author(s):

Emmanuel Buabin

Keyword(s):

Document Classification ◽

Text Corpus ◽

Text Document ◽

Neural Classifier ◽

Text Document Classification

Download Full-text

An Automatic Text Document Classification using Modified Weight and Semantic Method

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.k2123.1081219 ◽

2019 ◽

Vol 8 (12) ◽

pp. 2608-2622

Keyword(s):

Feature Extraction ◽

Text Mining ◽

Crime Rate ◽

Semantic Analysis ◽

Extraction Methods ◽

Support Vector ◽

Text Document ◽

Use Of Resources ◽

Benchmark Datasets ◽

Text Document Classification

Text mining is the process of transformation of useful information from the structured or unstructured sources. In text mining, feature extraction is one of the vital parts. This paper analyses some of the feature extraction methods and proposed the enhanced method for feature extraction. Term Frequency-Inverse Document Frequency(TF-IDF) method only assigned weight to the term based on the occurrence of the term. Now, it is enlarged to increases the weight of the most important words and decreases the weight of the less important words. This enlarged method is called as M-TF-IDF. This method does not consider the semantic similarity between the terms. Hence, Latent Semantic Analysis(LSA) method is used for feature extraction and dimensionality reduction. To analyze the performance of the proposed feature extraction methods, two benchmark datasets like Reuter-21578-R8 and 20 news group and two real time datasets like descriptive type answer dataset and crime news dataset are used. This paper used this proposed method for descriptive type answer evaluation. Manual evaluation of descriptive type paper may lead to discrepancy in the mark. It is eliminated by using this type of evaluation. The proposed method has been tested with answers written by learners of our department. It allows more accurate assessment and more effective evaluation of the learning process. This method has a lot of benefits such as reduced time and effort, efficient use of resources, reduced burden on the faculty and increased reliability of results. This proposed method also used to analyze the documents which contain the details about in and around Madurai city. Madurai is a sensitive place in the southern area of Tamilnadu in India. It has been collected from the Hindu archives. This news document has been classified like crime or not. It is also used to check in which month most crime rate occurs. This analysis used to reduce the crime rate in future. The classification algorithm Support Vector Machine(SVM) used to classify the dataset. The experimental analysis and results show that the performances of the proposed feature extraction methods are outperforming the existing feature extraction methods.

Download Full-text