An Automatic Text Document Classification using Modified Weight and Semantic Method

doi:10.35940/ijitee.k2123.1081219

An Automatic Text Document Classification using Modified Weight and Semantic Method

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.k2123.1081219 ◽

2019 ◽

Vol 8 (12) ◽

pp. 2608-2622

Keyword(s):

Feature Extraction ◽

Text Mining ◽

Crime Rate ◽

Semantic Analysis ◽

Extraction Methods ◽

Support Vector ◽

Text Document ◽

Use Of Resources ◽

Benchmark Datasets ◽

Text Document Classification

Text mining is the process of transformation of useful information from the structured or unstructured sources. In text mining, feature extraction is one of the vital parts. This paper analyses some of the feature extraction methods and proposed the enhanced method for feature extraction. Term Frequency-Inverse Document Frequency(TF-IDF) method only assigned weight to the term based on the occurrence of the term. Now, it is enlarged to increases the weight of the most important words and decreases the weight of the less important words. This enlarged method is called as M-TF-IDF. This method does not consider the semantic similarity between the terms. Hence, Latent Semantic Analysis(LSA) method is used for feature extraction and dimensionality reduction. To analyze the performance of the proposed feature extraction methods, two benchmark datasets like Reuter-21578-R8 and 20 news group and two real time datasets like descriptive type answer dataset and crime news dataset are used. This paper used this proposed method for descriptive type answer evaluation. Manual evaluation of descriptive type paper may lead to discrepancy in the mark. It is eliminated by using this type of evaluation. The proposed method has been tested with answers written by learners of our department. It allows more accurate assessment and more effective evaluation of the learning process. This method has a lot of benefits such as reduced time and effort, efficient use of resources, reduced burden on the faculty and increased reliability of results. This proposed method also used to analyze the documents which contain the details about in and around Madurai city. Madurai is a sensitive place in the southern area of Tamilnadu in India. It has been collected from the Hindu archives. This news document has been classified like crime or not. It is also used to check in which month most crime rate occurs. This analysis used to reduce the crime rate in future. The classification algorithm Support Vector Machine(SVM) used to classify the dataset. The experimental analysis and results show that the performances of the proposed feature extraction methods are outperforming the existing feature extraction methods.

Get full-text (via PubEx)

A Structural Analysis Based Feature Extraction Method for OCR System For Myanmar Printed Document Images

International Journal of Computer Vision and Image Processing ◽

10.4018/ijcvip.2012010102 ◽

2012 ◽

Vol 2 (1) ◽

pp. 16-41 ◽

Cited By ~ 1

Author(s):

Htwe Pa Pa Win ◽

Phyo Thu Thu Khine ◽

Khin Nwe Ni Tun

Keyword(s):

Feature Extraction ◽

Structural Analysis ◽

Character Recognition ◽

Optical Character Recognition ◽

Extraction Method ◽

Recognition Performance ◽

Extraction Methods ◽

Support Vector ◽

Svm Classifier ◽

Feature Extraction Method

This paper proposes a new feature extraction method for off-line recognition of Myanmar printed documents. One of the most important factors to achieve high recognition performance in Optical Character Recognition (OCR) system is the selection of the feature extraction methods. Different types of existing OCR systems used various feature extraction methods because of the diversity of the scripts’ natures. One major contribution of the work in this paper is the design of logically rigorous coding based features. To show the effectiveness of the proposed method, this paper assumed the documents are successfully segmented into characters and extracted features from these isolated Myanmar characters. These features are extracted using structural analysis of the Myanmar scripts. The experimental results have been carried out using the Support Vector Machine (SVM) classifier and compare the pervious proposed feature extraction method.

Get full-text (via PubEx)

Design of Text Categorization System Based on SVM

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.532-533.1191 ◽

2012 ◽

Vol 532-533 ◽

pp. 1191-1195 ◽

Cited By ~ 1

Author(s):

Zhen Yan Liu ◽

Wei Ping Wang ◽

Yong Wang

Keyword(s):

Feature Extraction ◽

Feature Selection ◽

Text Categorization ◽

Feature Selection Method ◽

Extraction Methods ◽

Support Vector ◽

Text Representation ◽

Text Feature ◽

Categorization System ◽

Classifier Training

This paper introduces the design of a text categorization system based on Support Vector Machine (SVM). It analyzes the high dimensional characteristic of text data, the reason why SVM is suitable for text categorization. According to system data flow this system is constructed. This system consists of three subsystems which are text representation, classifier training and text classification. The core of this system is the classifier training, but text representation directly influences the currency of classifier and the performance of the system. Text feature vector space can be built by different kinds of feature selection and feature extraction methods. No research can indicate which one is the best method, so many feature selection and feature extraction methods are all developed in this system. For a specific classification task every feature selection method and every feature extraction method will be tested, and then a set of the best methods will be adopted.

Get full-text (via PubEx)

Headnote Prediction Using Machine Learning

The International Arab Journal of Information Technology ◽

10.34028/iajit/18/5/7 ◽

2021 ◽

Vol 18 (5) ◽

Author(s):

Sarmad Mahar ◽

Sahar Zafar ◽

Kamran Nishat

Keyword(s):

Machine Learning ◽

Feature Extraction ◽

Active Learning ◽

Text Classification ◽

Extraction Methods ◽

Text Summarization ◽

Training Data ◽

Second Step ◽

Support Vector ◽

Classification Algorithms

Headnotes are the precise explanation and summary of legal points in an issued judgment. Law journals hire experienced lawyers to write these headnotes. These headnotes help the reader quickly determine the issue discussed in the case. Headnotes comprise two parts. The first part comprises the topic discussed in the judgment, and the second part contains a summary of that judgment. In this thesis, we design, develop and evaluate headnote prediction using machine learning, without involving human involvement. We divided this task into a two steps process. In the first step, we predict law points used in the judgment by using text classification algorithms. The second step generates a summary of the judgment using text summarization techniques. To achieve this task, we created a Databank by extracting data from different law sources in Pakistan. We labelled training data generated based on Pakistan law websites. We tested different feature extraction methods on judiciary data to improve our system. Using these feature extraction methods, we developed a dictionary of terminology for ease of reference and utility. Our approach achieves 65% accuracy by using Linear Support Vector Classification with tri-gram and without stemmer. Using active learning our system can continuously improve the accuracy with the increased labelled examples provided by the users of the system.

Get full-text (via PubEx)

Superpixel-Based Minimum Noise Fraction Feature Extraction for Classification of Hyperspectral Images

Traitement du signal ◽

10.18280/ts.370514 ◽

2020 ◽

Vol 37 (5) ◽

pp. 812-822

Author(s):

Behnam Asghari Beirami ◽

Mehdi Mokhtarzade

Keyword(s):

Feature Extraction ◽

Extraction Methods ◽

Hyperspectral Images ◽

Support Vector ◽

Minimum Noise Fraction ◽

Vector Machines ◽

Noise Covariance ◽

Noise Fraction ◽

Minimum Noise

In this paper, a novel feature extraction technique called SuperMNF is proposed, which is an extension of the minimum noise fraction (MNF) transformation. In SuperMNF, each superpixel has its own transformation matrix and MNF transformation is performed on each superpixel individually. The basic idea behind the SuperMNF is that each superpixel contains its specific signal and noise covariance matrices which are different from the adjacent superpixels. The extracted features, owning spatial-spectral content and provided in the lower dimension, are classified by maximum likelihood classifier and support vector machines. Experiments that are conducted on two real hyperspectral images, named Indian Pines and Pavia University, demonstrate the efficiency of SuperMNF since it yielded more promising results than some other feature extraction methods (MNF, PCA, SuperPCA, KPCA, and MMP).

Get full-text (via PubEx)

Hindi Text Document Classification System Using SVM and Fuzzy

International Journal of Rough Sets and Data Analysis ◽

10.4018/ijrsda.2018100101 ◽

2018 ◽

Vol 5 (4) ◽

pp. 1-31 ◽

Cited By ~ 8

Author(s):

Shalini Puri ◽

Satya Prakash Singh

Keyword(s):

Classification System ◽

Character Recognition ◽

Optical Character Recognition ◽

Document Classification ◽

Data Availability ◽

Support Vector ◽

Handwritten Documents ◽

Text Document ◽

Survey Report ◽

Text Document Classification

In recent years, many information retrieval, character recognition, and feature extraction methodologies in Devanagari and especially in Hindi have been proposed for different domain areas. Due to enormous scanned data availability and to provide an advanced improvement of existing Hindi automated systems beyond optical character recognition, a new idea of Hindi printed and handwritten document classification system using support vector machine and fuzzy logic is introduced. This first pre-processes and then classifies textual imaged documents into predefined categories. With this concept, this article depicts a feasibility study of such systems with the relevance of Hindi, a survey report of statistical measurements of Hindi keywords obtained from different sources, and the inherent challenges found in printed and handwritten documents. The technical reviews are provided and graphically represented to compare many parameters and estimate contents, forms and classifiers used in various existing techniques.

Get full-text (via PubEx)

Comparison of Text Mining Feature Extraction Methods Using Moderated vs Non-Moderated Blogs

Proceedings of the 9th International Conference on Digital Public Health - DPH2019 ◽

10.1145/3357729.3357740 ◽

2019 ◽

Author(s):

Abu Saleh Md Tayeen ◽

Saleem Masadeh ◽

Abderrahmen Mtibaa ◽

Satyajayant Misra ◽

Moumita Choudhury

Keyword(s):

Feature Extraction ◽

Text Mining ◽

Extraction Methods

Get full-text (via PubEx)

Text Document Classification Using Support Vector Machine with Feature Selection Using Singular Value Decomposition

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.905.528 ◽

2014 ◽

Vol 905 ◽

pp. 528-532

Author(s):

Hoan Manh Dau ◽

Ning Xu

Keyword(s):

Support Vector Machine ◽

Singular Value Decomposition ◽

Decision Tree ◽

Document Classification ◽

Singular Value ◽

Multidimensional Data ◽

Support Vector ◽

Text Document ◽

Text Document Classification ◽

Value Decomposition

Text document classification is content analysis task of the text document and then giving decision (or giving a prediction) whether this text document belongs to which group among given text document ones. There are many classification techniques such as decision method basing on Naive Bayer, decision tree, k-Nearest neighbor (KNN), neural network, Support Vector Machine (SVM) method. Among those techniques, SVM is considered the popular and powerful one, especially, it is suitable to huge and multidimensional data classification. Text document classification with characteristics of very huge dimensional numbers and selecting features before classifying impact the classification results. Support Vector Machine is a very effective method in this field. This article studies Support Vector Machine and applies it in the problem of text document classification. The study shows that Support Vector Machine method with choosing features by singular value decomposition (SVD) method is better than other methods and decision tree.

Get full-text (via PubEx)

A novel feature fusion based on the evolutionary features for protein fold recognition using support vector machines

10.1101/845727 ◽

2019 ◽

Author(s):

Mohammad Saleh Refahi ◽

A. Mir ◽

Jalal A. Nasiri

Keyword(s):

Feature Extraction ◽

Information Gain ◽

Fold Recognition ◽

Dimensional Structure ◽

Support Vector ◽

Protein Fold ◽

Protein Fold Recognition ◽

Evolutionary Features ◽

Cross Covariance ◽

Benchmark Datasets

AbstractProtein fold recognition plays a crucial role in discovering three-dimensional structure of proteins and protein functions. Several approaches have been employed for the prediction of protein folds. Some of these approaches are based on extracting features from protein sequences and using a strong classifier. Feature extraction techniques generally utilize syntactical-based information, evolutionary-based information and physiochemical-based information to extract features. In recent years, Finding an efficient technique for integrating discriminate features have been received advancing attention. In this study, we integrate Auto-Cross-Covariance (ACC) and Separated dimer (SD) evolutionary feature extraction methods. The results features are scored by Information gain (IG) to define and select several discriminated features. According to three benchmark datasets, DD, RDD and EDD, the results of the support vector machine (SVM) show more than 6% improvement in accuracy on these benchmark datasets.

Get full-text (via PubEx)

Multi-Stage Feature Extraction and Classification for Ship-Radiated Noise

Sensors ◽

10.3390/s22010112 ◽

2021 ◽

Vol 22 (1) ◽

pp. 112

Author(s):

Hamada Esmaiel ◽

Dongri Xie ◽

Zeyad A. H. Qasem ◽

Haixin Sun ◽

Jie Qi ◽

...

Keyword(s):

Feature Extraction ◽

Recognition Rate ◽

Extraction Methods ◽

Permutation Entropy ◽

Support Vector ◽

Intrinsic Mode Functions ◽

Radiated Noise ◽

Passive Sonar ◽

Multi Stage ◽

Mode Decomposition

Due to the complexity and unique features of the hydroacoustic channel, ship-radiated noise (SRN) detected using a passive sonar tends mostly to distort. SRN feature extraction has been proposed to improve the detected passive sonar signal. Unfortunately, the current methods used in SRN feature extraction have many shortcomings. Considering this, in this paper we propose a new multi-stage feature extraction approach to enhance the current SRN feature extractions based on enhanced variational mode decomposition (EVMD), weighted permutation entropy (WPE), local tangent space alignment (LTSA), and particle swarm optimization-based support vector machine (PSO-SVM). In the proposed method, first, we enhance the decomposition operation of the conventional VMD by decomposing the SRN signal into a finite group of intrinsic mode functions (IMFs) and then calculate the WPE of each IMF. Then, the high-dimensional features obtained are reduced to two-dimensional ones by using the LTSA method. Finally, the feature vectors are fed into the PSO-SVM multi-class classifier to realize the classification of different types of SRN sample. The simulation and experimental results demonstrate that the recognition rate of the proposed method overcomes the conventional SRN feature extraction methods, and it has a recognition rate of up to 96.6667%.

Get full-text (via PubEx)

Text Document Classification basedon Least Square Support Vector Machines with Singular Value Decomposition

International Journal of Computer Applications ◽

10.5120/3312-4540 ◽

2011 ◽

Vol 27 (7) ◽

pp. 21-26 ◽

Cited By ~ 17

Author(s):

M.Ramakrishna Murty ◽

J.V.R Murthy ◽

Prasad Reddy P.V.G.D

Keyword(s):

Support Vector Machines ◽

Singular Value Decomposition ◽

Document Classification ◽

Singular Value ◽

Least Square ◽

Support Vector ◽

Text Document ◽

Vector Machines ◽

Text Document Classification ◽

Value Decomposition

Get full-text (via PubEx)