A Comparative Study on Feature Selection of Text Categorization for Hidden Markov Models

Author(s):  
Kwan Yi ◽  
Jamshid Beheshti

In document representation for digitalized text, feature selection refers to the selection of the terms of representing a document and of distinguishing it from other documents. This study probes different feature selection methods for HMM learning models to explore how they affect the model performance, which is experimented in the context of text categorization task.Dans la représentation documentaire des textes numérisés, la sélection des caractéristiques se fonde sur la sélection des termes représentant et distinguant un document des autres documents. Cette étude examine différents modèles de sélection de caractéristiques pour les modèles d’apprentissage MMC, afin d’explorer comment ils affectent la performance du modèle, qui est observé dans le contexte de la tâche de catégorisation textuelle. 

2012 ◽  
Vol 532-533 ◽  
pp. 1191-1195 ◽  
Author(s):  
Zhen Yan Liu ◽  
Wei Ping Wang ◽  
Yong Wang

This paper introduces the design of a text categorization system based on Support Vector Machine (SVM). It analyzes the high dimensional characteristic of text data, the reason why SVM is suitable for text categorization. According to system data flow this system is constructed. This system consists of three subsystems which are text representation, classifier training and text classification. The core of this system is the classifier training, but text representation directly influences the currency of classifier and the performance of the system. Text feature vector space can be built by different kinds of feature selection and feature extraction methods. No research can indicate which one is the best method, so many feature selection and feature extraction methods are all developed in this system. For a specific classification task every feature selection method and every feature extraction method will be tested, and then a set of the best methods will be adopted.


Entropy ◽  
2019 ◽  
Vol 21 (6) ◽  
pp. 602 ◽  
Author(s):  
Jaesung Lee ◽  
Jaegyun Park ◽  
Hae-Cheon Kim ◽  
Dae-Won Kim

Multi-label feature selection is an important task for text categorization. This is because it enables learning algorithms to focus on essential features that foreshadow relevant categories, thereby improving the accuracy of text categorization. Recent studies have considered the hybridization of evolutionary feature wrappers and filters to enhance the evolutionary search process. However, the relative effectiveness of feature subset searches of evolutionary and feature filter operators has not been considered. This results in degenerated final feature subsets. In this paper, we propose a novel hybridization approach based on competition between the operators. This enables the proposed algorithm to apply each operator selectively and modify the feature subset according to its relative effectiveness, unlike conventional methods. The experimental results on 16 text datasets verify that the proposed method is superior to conventional methods.


2015 ◽  
Vol 42 (4) ◽  
pp. 1941-1949 ◽  
Author(s):  
Roberto H.W. Pinheiro ◽  
George D.C. Cavalcanti ◽  
Tsang Ing Ren

Author(s):  
Kwan Yi ◽  
Jamshid Beheshti

The Hidden Markov model (HMM) has been successfully used for speech recognition, part of speech tagging, and pattern recognition. In this study, we apply the HMM to automatically categorize digital documents into a standard library classification scheme. In the proposed framework, A HMM-based system is viewed as a model to generate a list of words and each document is seen as. . .


2020 ◽  
Vol 163 (3) ◽  
pp. 1267-1285 ◽  
Author(s):  
Jens Kiesel ◽  
Philipp Stanzel ◽  
Harald Kling ◽  
Nicola Fohrer ◽  
Sonja C. Jähnig ◽  
...  

AbstractThe assessment of climate change and its impact relies on the ensemble of models available and/or sub-selected. However, an assessment of the validity of simulated climate change impacts is not straightforward because historical data is commonly used for bias-adjustment, to select ensemble members or to define a baseline against which impacts are compared—and, naturally, there are no observations to evaluate future projections. We hypothesize that historical streamflow observations contain valuable information to investigate practices for the selection of model ensembles. The Danube River at Vienna is used as a case study, with EURO-CORDEX climate simulations driving the COSERO hydrological model. For each selection method, we compare observed to simulated streamflow shift from the reference period (1960–1989) to the evaluation period (1990–2014). Comparison against no selection shows that an informed selection of ensemble members improves the quantification of climate change impacts. However, the selection method matters, with model selection based on hindcasted climate or streamflow alone is misleading, while methods that maintain the diversity and information content of the full ensemble are favorable. Prior to carrying out climate impact assessments, we propose splitting the long-term historical data and using it to test climate model performance, sub-selection methods, and their agreement in reproducing the indicator of interest, which further provide the expectable benchmark of near- and far-future impact assessments. This test is well-suited to be applied in multi-basin experiments to obtain better understanding of uncertainty propagation and more universal recommendations regarding uncertainty reduction in hydrological impact studies.


Author(s):  
I. GALIANO ◽  
E. SANCHIS ◽  
F. CASACUBERTA ◽  
I. TORRES

The design of current acoustic-phonetic decoders for a specific language involves the selection of an adequate set of sublexical units, and a choice of the mathematical framework for modelling the corresponding units. In this work, the baseline chosen for continuous Spanish speech consists of 23 sublexical units that roughly correspond to the 24 Spanish phonemes. The process of selection of such a baseline was based on language phonetic criteria and some experiments with an available speech corpora. On the other hand, two types of models were chosen for this work, conventional Hidden Markov Models and Inferred Stochastic Regular Grammars. With these two choices we could compare classical Hidden Markov modelling where the structure of a unit-model is deductively supplied, with Grammatical Inference modelling where the baseforms of model-units are automatically generated from training samples. The best speaker-independent phone recognition rate was 64% for the first type of modelling, and 66% for the second type.


2014 ◽  
Vol 988 ◽  
pp. 511-516 ◽  
Author(s):  
Jin Tao Shi ◽  
Hui Liang Liu ◽  
Yuan Xu ◽  
Jun Feng Yan ◽  
Jian Feng Xu

Machine learning is important solution in the research of Chinese text sentiment categorization , the text feature selection is critical to the classification performance. However, the classical feature selection methods have better effect on the global categories, but it misses many representative feature words of each category. This paper presents an improved information gain method that integrates word frequency and degree of feature word sentiment into traditional information gain methods. Experiments show that classifier improved by this method has better classification .


Sign in / Sign up

Export Citation Format

Share Document