Comparison and Improvements of Feature Extraction Methods for Text Categorization

This paper introduces the design of a text categorization system based on Support Vector Machine (SVM). It analyzes the high dimensional characteristic of text data, the reason why SVM is suitable for text categorization. According to system data flow this system is constructed. This system consists of three subsystems which are text representation, classifier training and text classification. The core of this system is the classifier training, but text representation directly influences the currency of classifier and the performance of the system. Text feature vector space can be built by different kinds of feature selection and feature extraction methods. No research can indicate which one is the best method, so many feature selection and feature extraction methods are all developed in this system. For a specific classification task every feature selection method and every feature extraction method will be tested, and then a set of the best methods will be adopted.

Download Full-text

Headnote Prediction Using Machine Learning

The International Arab Journal of Information Technology ◽

10.34028/iajit/18/5/7 ◽

2021 ◽

Vol 18 (5) ◽

Author(s):

Sarmad Mahar ◽

Sahar Zafar ◽

Kamran Nishat

Keyword(s):

Machine Learning ◽

Feature Extraction ◽

Active Learning ◽

Text Classification ◽

Extraction Methods ◽

Text Summarization ◽

Training Data ◽

Second Step ◽

Support Vector ◽

Classification Algorithms

Headnotes are the precise explanation and summary of legal points in an issued judgment. Law journals hire experienced lawyers to write these headnotes. These headnotes help the reader quickly determine the issue discussed in the case. Headnotes comprise two parts. The first part comprises the topic discussed in the judgment, and the second part contains a summary of that judgment. In this thesis, we design, develop and evaluate headnote prediction using machine learning, without involving human involvement. We divided this task into a two steps process. In the first step, we predict law points used in the judgment by using text classification algorithms. The second step generates a summary of the judgment using text summarization techniques. To achieve this task, we created a Databank by extracting data from different law sources in Pakistan. We labelled training data generated based on Pakistan law websites. We tested different feature extraction methods on judiciary data to improve our system. Using these feature extraction methods, we developed a dictionary of terminology for ease of reference and utility. Our approach achieves 65% accuracy by using Linear Support Vector Classification with tri-gram and without stemmer. Using active learning our system can continuously improve the accuracy with the increased labelled examples provided by the users of the system.

Download Full-text

Applied-Information Technology with Distributed Text Feature Extraction Method Based on MapReduce

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.1046.444 ◽

2014 ◽

Vol 1046 ◽

pp. 444-448 ◽

Cited By ~ 1

Author(s):

Lu Chen ◽

Tao Zhang ◽

Yuan Yuan Ma ◽

Cheng Zhou

Keyword(s):

Information Technology ◽

Feature Extraction ◽

Text Classification ◽

Extraction Method ◽

Text Processing ◽

Rapid Development ◽

Internet Technology ◽

Feature Extraction Method ◽

Computing Model ◽

Text Feature

With the rapid development of Internet technology and information technology, the emergence of a large number of document data, text classification techniques for handling massive amounts of data is becoming increasingly important. This paper presents a distributed text feature extraction method based on distributed computing model—MapReduce. In the process of mass text processing, solve the problem of processing text size limit and inadequate performance, provide the research of text feature extraction method a new way of thinking.

Download Full-text

Efficient text feature extraction by integrating the average linkage and K-medoids clustering

Modern Physics Letters B ◽

10.1142/s0217984921501517 ◽

2021 ◽

pp. 2150151

Author(s):

Dasong Sun

Keyword(s):

Feature Extraction ◽

Text Classification ◽

Experimental Results ◽

The Other ◽

Central Feature ◽

Number Of Clusters ◽

Average Linkage ◽

Text Feature

By clustering feature words, we can not only simplify the dimension of feature subsets, but also eliminate the redundancy of the feature. However, for a feature set with very large dimensions, the traditional [Formula: see text]-medoids algorithm is difficult to accurately estimate the value of [Formula: see text]. Moreover, the clustering results of the average linkage (AL) algorithm cannot be divided again, and the AL algorithm cannot be directly used for text classification. In order to overcome the limitations of AL and [Formula: see text]-medoids, in this paper, we combine the two algorithms together so as to be mutually complementary to each other. In particular, in order to meet the purpose of text classification, we improve the AL algorithm and propose the [Formula: see text] testing statistics to obtain the approximate number of clusters. Finally, the central feature words are preserved, and the other feature words are deleted. The experimental results show that the new algorithm largely eliminates the redundancy of the feature. Compared with the traditional TF-IDF algorithms, the performance of the text classification of the new algorithm is improved.

Download Full-text

Research on Digital Forensics Based on Uyghur Web Text Classification

Cyber Warfare and Terrorism ◽

10.4018/978-1-7998-2466-4.ch093 ◽

2020 ◽

pp. 1586-1597

Author(s):

Yasen Aizezi ◽

Anwar Jamal ◽

Ruxianguli Abudurexiti ◽

Mutalipu Muming

Keyword(s):

Mutual Information ◽

Text Classification ◽

Text Categorization ◽

Digital Forensics ◽

Feature Space ◽

Experimental Result ◽

Support Vector ◽

Web Documents ◽

Normalized Mutual Information ◽

Plain Text

This paper mainly discusses the use of mutual information (MI) and Support Vector Machines (SVMs) for Uyghur Web text classification and digital forensics process of web text categorization: automatic classification and identification, conversion and pretreatment of plain text based on encoding features of various existing Uyghur Web documents etc., introduces the pre-paratory work for Uyghur Web text encoding. Focusing on the non-Uyghur characters and stop words in the web texts filtering, we put forward a Multi-feature Space Normalized Mutual Information (M-FNMI) algorithm and replace MI between single feature and category with mutual information (MI) between input feature combination and category so as to extract more accurate feature words; finally, we classify features with support vector machine (SVM) algorithm. The experimental result shows that this scheme has a high precision of classification and can provide criterion for digital forensics with specific purpose.

Download Full-text

Research on Digital Forensics Based on Uyghur Web Text Classification

Digital Forensics and Forensic Investigations ◽

10.4018/978-1-7998-3025-2.ch032 ◽

2020 ◽

pp. 485-496

Author(s):

Yasen Aizezi ◽

Anwar Jamal ◽

Ruxianguli Abudurexiti ◽

Mutalipu Muming

Keyword(s):

Mutual Information ◽

Text Classification ◽

Text Categorization ◽

Digital Forensics ◽

Feature Space ◽

Experimental Result ◽

Support Vector ◽

Web Documents ◽

Normalized Mutual Information ◽

Plain Text

This paper mainly discusses the use of mutual information (MI) and Support Vector Machines (SVMs) for Uyghur Web text classification and digital forensics process of web text categorization: automatic classification and identification, conversion and pretreatment of plain text based on encoding features of various existing Uyghur Web documents etc., introduces the pre-paratory work for Uyghur Web text encoding. Focusing on the non-Uyghur characters and stop words in the web texts filtering, we put forward a Multi-feature Space Normalized Mutual Information (M-FNMI) algorithm and replace MI between single feature and category with mutual information (MI) between input feature combination and category so as to extract more accurate feature words; finally, we classify features with support vector machine (SVM) algorithm. The experimental result shows that this scheme has a high precision of classification and can provide criterion for digital forensics with specific purpose.

Download Full-text

FEATURE EXTRACTION BASED ON DIRECT CALCULATION OF MUTUAL INFORMATION

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s0218001407005892 ◽

2007 ◽

Vol 21 (07) ◽

pp. 1213-1231 ◽

Cited By ~ 9

Author(s):

NOJUN KWAK

Keyword(s):

Feature Extraction ◽

Mutual Information ◽

Direct Calculation ◽

Extraction Methods ◽

Descent Method ◽

Gradient Descent Method ◽

Probability Density Estimation ◽

Classification Problems ◽

Feature Extraction Method ◽

Window Method

In many pattern recognition problems, it is desirable to reduce the number of input features by extracting important features related to the problems. By focusing on only the problem-relevant features, the dimension of features can be greatly reduced and thereby can result in a better generalization performance with less computational complexity. In this paper, we propose a feature extraction method for handling classification problems. The proposed algorithm is used to search for a set of linear combinations of the original features, whose mutual information with the output class can be maximized. The mutual information between the extracted features and the output class is calculated by using the probability density estimation based on the Parzen window method. A greedy algorithm using the gradient descent method is used to determine the new features. The computational load is proportional to the square of the number of samples. The proposed method was applied to several classification problems, which showed better or comparable performances than the conventional feature extraction methods.

Download Full-text

An Improved Method of Short Text Feature Extraction Based on Words Co-Occurrence

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.519-520.842 ◽

2014 ◽

Vol 519-520 ◽

pp. 842-845 ◽

Cited By ~ 1

Author(s):

Li Hong Wang

Keyword(s):

Feature Extraction ◽

Chinese Text ◽

Traditional Method ◽

Low Frequency ◽

Text Clustering ◽

Improved Method ◽

Short Text ◽

Text Feature ◽

Short Text Clustering

In Chinese text clustering, short text is very different from traditional long text, principally in the low frequency of words. As a result, traditional text feature extraction and the method for weight calculating is not directly suitable for short text clustering .To solve the problem of clustering drift in short text segments ,this paper proposes an method for feature extraction through improving the method of weight calculating based on words co-occurrence. Experiments show the method can get better performance in Chinese short-text clustering compared with the traditional method TF-IDF.

Download Full-text

Survey Paper on Feature Extraction Methods in Text Categorization

International Journal of Computer Applications ◽

10.5120/ijca2017914145 ◽

2017 ◽

Vol 166 (11) ◽

pp. 11-17 ◽

Cited By ~ 4

Author(s):

Dixa Saxena ◽

S. K. ◽

K. N.

Keyword(s):

Feature Extraction ◽

Text Categorization ◽

Extraction Methods ◽

Survey Paper

Download Full-text

Web Text Categorization Based on Statistical Merging Algorithm in Big Data Environment

International Journal of Ambient Computing and Intelligence ◽

10.4018/ijaci.2019070102 ◽

2019 ◽

Vol 10 (3) ◽

pp. 17-32 ◽

Cited By ~ 14

Author(s):

Rujuan Wang ◽

Gang Wang

Keyword(s):

Complex Network ◽

Text Classification ◽

Text Categorization ◽

Large Scale ◽

Feature Selection Method ◽

Point Of View ◽

Classification Method ◽

Data Sampling ◽

Modern Information Technology ◽

Text Feature

In the field of modern information technology, how to find information quickly, accurately and comprehensively that users really needed has become the focus of research in this field. In this article, a feature selection method based on a complex network is proposed for the structure and content characteristics of large-scale web text information. The preprocessed web text is converted into a complex network. The nodes in the network correspond to the entries in the text. The edges of the network correspond to the links between the entries in the text, and the degree of nodes and the aggregation system are used. Second, the text classification method is studied from the point of view of data sampling, and a text classification method based on density statistics is proposed. This method uses not only the density information of the text feature set in the classification process, but also the use of statistical merging criteria to get the text. The difference information of each feature has a better classification effect for large text collections.

Download Full-text