scholarly journals An Automatic Text Document Classification using Modified Weight and Semantic Method

Text mining is the process of transformation of useful information from the structured or unstructured sources. In text mining, feature extraction is one of the vital parts. This paper analyses some of the feature extraction methods and proposed the enhanced method for feature extraction. Term Frequency-Inverse Document Frequency(TF-IDF) method only assigned weight to the term based on the occurrence of the term. Now, it is enlarged to increases the weight of the most important words and decreases the weight of the less important words. This enlarged method is called as M-TF-IDF. This method does not consider the semantic similarity between the terms. Hence, Latent Semantic Analysis(LSA) method is used for feature extraction and dimensionality reduction. To analyze the performance of the proposed feature extraction methods, two benchmark datasets like Reuter-21578-R8 and 20 news group and two real time datasets like descriptive type answer dataset and crime news dataset are used. This paper used this proposed method for descriptive type answer evaluation. Manual evaluation of descriptive type paper may lead to discrepancy in the mark. It is eliminated by using this type of evaluation. The proposed method has been tested with answers written by learners of our department. It allows more accurate assessment and more effective evaluation of the learning process. This method has a lot of benefits such as reduced time and effort, efficient use of resources, reduced burden on the faculty and increased reliability of results. This proposed method also used to analyze the documents which contain the details about in and around Madurai city. Madurai is a sensitive place in the southern area of Tamilnadu in India. It has been collected from the Hindu archives. This news document has been classified like crime or not. It is also used to check in which month most crime rate occurs. This analysis used to reduce the crime rate in future. The classification algorithm Support Vector Machine(SVM) used to classify the dataset. The experimental analysis and results show that the performances of the proposed feature extraction methods are outperforming the existing feature extraction methods.

Author(s):  
Htwe Pa Pa Win ◽  
Phyo Thu Thu Khine ◽  
Khin Nwe Ni Tun

This paper proposes a new feature extraction method for off-line recognition of Myanmar printed documents. One of the most important factors to achieve high recognition performance in Optical Character Recognition (OCR) system is the selection of the feature extraction methods. Different types of existing OCR systems used various feature extraction methods because of the diversity of the scripts’ natures. One major contribution of the work in this paper is the design of logically rigorous coding based features. To show the effectiveness of the proposed method, this paper assumed the documents are successfully segmented into characters and extracted features from these isolated Myanmar characters. These features are extracted using structural analysis of the Myanmar scripts. The experimental results have been carried out using the Support Vector Machine (SVM) classifier and compare the pervious proposed feature extraction method.


2012 ◽  
Vol 532-533 ◽  
pp. 1191-1195 ◽  
Author(s):  
Zhen Yan Liu ◽  
Wei Ping Wang ◽  
Yong Wang

This paper introduces the design of a text categorization system based on Support Vector Machine (SVM). It analyzes the high dimensional characteristic of text data, the reason why SVM is suitable for text categorization. According to system data flow this system is constructed. This system consists of three subsystems which are text representation, classifier training and text classification. The core of this system is the classifier training, but text representation directly influences the currency of classifier and the performance of the system. Text feature vector space can be built by different kinds of feature selection and feature extraction methods. No research can indicate which one is the best method, so many feature selection and feature extraction methods are all developed in this system. For a specific classification task every feature selection method and every feature extraction method will be tested, and then a set of the best methods will be adopted.


Author(s):  
Sarmad Mahar ◽  
Sahar Zafar ◽  
Kamran Nishat

Headnotes are the precise explanation and summary of legal points in an issued judgment. Law journals hire experienced lawyers to write these headnotes. These headnotes help the reader quickly determine the issue discussed in the case. Headnotes comprise two parts. The first part comprises the topic discussed in the judgment, and the second part contains a summary of that judgment. In this thesis, we design, develop and evaluate headnote prediction using machine learning, without involving human involvement. We divided this task into a two steps process. In the first step, we predict law points used in the judgment by using text classification algorithms. The second step generates a summary of the judgment using text summarization techniques. To achieve this task, we created a Databank by extracting data from different law sources in Pakistan. We labelled training data generated based on Pakistan law websites. We tested different feature extraction methods on judiciary data to improve our system. Using these feature extraction methods, we developed a dictionary of terminology for ease of reference and utility. Our approach achieves 65% accuracy by using Linear Support Vector Classification with tri-gram and without stemmer. Using active learning our system can continuously improve the accuracy with the increased labelled examples provided by the users of the system.


2020 ◽  
Vol 37 (5) ◽  
pp. 812-822
Author(s):  
Behnam Asghari Beirami ◽  
Mehdi Mokhtarzade

In this paper, a novel feature extraction technique called SuperMNF is proposed, which is an extension of the minimum noise fraction (MNF) transformation. In SuperMNF, each superpixel has its own transformation matrix and MNF transformation is performed on each superpixel individually. The basic idea behind the SuperMNF is that each superpixel contains its specific signal and noise covariance matrices which are different from the adjacent superpixels. The extracted features, owning spatial-spectral content and provided in the lower dimension, are classified by maximum likelihood classifier and support vector machines. Experiments that are conducted on two real hyperspectral images, named Indian Pines and Pavia University, demonstrate the efficiency of SuperMNF since it yielded more promising results than some other feature extraction methods (MNF, PCA, SuperPCA, KPCA, and MMP).


2018 ◽  
Vol 5 (4) ◽  
pp. 1-31 ◽  
Author(s):  
Shalini Puri ◽  
Satya Prakash Singh

In recent years, many information retrieval, character recognition, and feature extraction methodologies in Devanagari and especially in Hindi have been proposed for different domain areas. Due to enormous scanned data availability and to provide an advanced improvement of existing Hindi automated systems beyond optical character recognition, a new idea of Hindi printed and handwritten document classification system using support vector machine and fuzzy logic is introduced. This first pre-processes and then classifies textual imaged documents into predefined categories. With this concept, this article depicts a feasibility study of such systems with the relevance of Hindi, a survey report of statistical measurements of Hindi keywords obtained from different sources, and the inherent challenges found in printed and handwritten documents. The technical reviews are provided and graphically represented to compare many parameters and estimate contents, forms and classifiers used in various existing techniques.


Author(s):  
Abu Saleh Md Tayeen ◽  
Saleem Masadeh ◽  
Abderrahmen Mtibaa ◽  
Satyajayant Misra ◽  
Moumita Choudhury

2014 ◽  
Vol 905 ◽  
pp. 528-532
Author(s):  
Hoan Manh Dau ◽  
Ning Xu

Text document classification is content analysis task of the text document and then giving decision (or giving a prediction) whether this text document belongs to which group among given text document ones. There are many classification techniques such as decision method basing on Naive Bayer, decision tree, k-Nearest neighbor (KNN), neural network, Support Vector Machine (SVM) method. Among those techniques, SVM is considered the popular and powerful one, especially, it is suitable to huge and multidimensional data classification. Text document classification with characteristics of very huge dimensional numbers and selecting features before classifying impact the classification results. Support Vector Machine is a very effective method in this field. This article studies Support Vector Machine and applies it in the problem of text document classification. The study shows that Support Vector Machine method with choosing features by singular value decomposition (SVD) method is better than other methods and decision tree.


2019 ◽  
Author(s):  
Mohammad Saleh Refahi ◽  
A. Mir ◽  
Jalal A. Nasiri

AbstractProtein fold recognition plays a crucial role in discovering three-dimensional structure of proteins and protein functions. Several approaches have been employed for the prediction of protein folds. Some of these approaches are based on extracting features from protein sequences and using a strong classifier. Feature extraction techniques generally utilize syntactical-based information, evolutionary-based information and physiochemical-based information to extract features. In recent years, Finding an efficient technique for integrating discriminate features have been received advancing attention. In this study, we integrate Auto-Cross-Covariance (ACC) and Separated dimer (SD) evolutionary feature extraction methods. The results features are scored by Information gain (IG) to define and select several discriminated features. According to three benchmark datasets, DD, RDD and EDD, the results of the support vector machine (SVM) show more than 6% improvement in accuracy on these benchmark datasets.


Sensors ◽  
2021 ◽  
Vol 22 (1) ◽  
pp. 112
Author(s):  
Hamada Esmaiel ◽  
Dongri Xie ◽  
Zeyad A. H. Qasem ◽  
Haixin Sun ◽  
Jie Qi ◽  
...  

Due to the complexity and unique features of the hydroacoustic channel, ship-radiated noise (SRN) detected using a passive sonar tends mostly to distort. SRN feature extraction has been proposed to improve the detected passive sonar signal. Unfortunately, the current methods used in SRN feature extraction have many shortcomings. Considering this, in this paper we propose a new multi-stage feature extraction approach to enhance the current SRN feature extractions based on enhanced variational mode decomposition (EVMD), weighted permutation entropy (WPE), local tangent space alignment (LTSA), and particle swarm optimization-based support vector machine (PSO-SVM). In the proposed method, first, we enhance the decomposition operation of the conventional VMD by decomposing the SRN signal into a finite group of intrinsic mode functions (IMFs) and then calculate the WPE of each IMF. Then, the high-dimensional features obtained are reduced to two-dimensional ones by using the LTSA method. Finally, the feature vectors are fed into the PSO-SVM multi-class classifier to realize the classification of different types of SRN sample. The simulation and experimental results demonstrate that the recognition rate of the proposed method overcomes the conventional SRN feature extraction methods, and it has a recognition rate of up to 96.6667%.


Sign in / Sign up

Export Citation Format

Share Document