Efficient multi-cluster feature selection on text data

2019 ◽  
Vol 40 (8) ◽  
pp. 1583-1598
Author(s):  
Ananya Gupta ◽  
Shahin Ara Begum
2018 ◽  
Vol 07 (01) ◽  
pp. 1750015
Author(s):  
Bingqing Lin ◽  
Zhen Pang ◽  
Qihua Wang

This paper concerns with variable screening when highly correlated variables exist in high-dimensional linear models. We propose a novel cluster feature selection (CFS) procedure based on the elastic net and linear correlation variable screening to enjoy the benefits of the two methods. When calculating the correlation between the predictor and the response, we consider highly correlated groups of predictors instead of the individual ones. This is in contrast to the usual linear correlation variable screening. Within each correlated group, we apply the elastic net to select variables and estimate their parameters. This avoids the drawback of mistakenly eliminating true relevant variables when they are highly correlated like LASSO [R. Tibshirani, Regression shrinkage and selection via the lasso, J. R. Stat. Soc. Ser. B 58 (1996) 268–288] does. After applying the CFS procedure, the maximum absolute correlation coefficient between clusters becomes smaller and any common model selection methods like sure independence screening (SIS) [J. Fan and J. Lv, Sure independence screening for ultrahigh dimensional feature space, J. R. Stat. Soc. Ser. B 70 (2008) 849–911] or LASSO can be applied to improve the results. Extensive numerical examples including pure simulation examples and semi-real examples are conducted to show the good performances of our procedure.


Author(s):  
Brahim Ait Benali ◽  
Soukaina Mihi ◽  
Ismail El Bazi ◽  
Nabil Laachfoubi

Many features can be extracted from the massive volume of data in different types that are available nowadays on social media. The growing demand for multimedia applications was an essential factor in this regard, particularly in the case of text data. Often, using the full feature set for each of these activities can be time-consuming and can also negatively impact performance. It is challenging to find a subset of features that are useful for a given task due to a large number of features. In this paper, we employed a feature selection approach using the genetic algorithm to identify the optimized feature set. Afterward, the best combination of the optimal feature set is used to identify and classify the Arabic named entities (NEs) based on support vector. Experimental results show that our system reaches a state-of-the-art performance of the Arab NER on social media and significantly outperforms the previous systems.


2019 ◽  
Vol 22 (1) ◽  
pp. 44-48
Author(s):  
Syukriyanto Latif

The purpose of this research is to know dimension reduction parameter value at feature selection so as to improve accuracy and reduce computation time. This system uses text mining technology that extracts text data to find information from a set of documents. Word weighting and Term Reduction Technique The term Frequency Thresholding is used in the feature selection process, while in the classification process using the Naive Bayes algorithm. the abstract of the journal is categorized into 3 namely Data Mining (DM), Intelligent Transport System (ITS) and Multimedia (MM). The total number of test data and training data is 150 data. The best classification results are obtained when the dimension reduction parameter value is 30%. At that condition obtained an average accuracy of 87.33% with a computation time of 4 minutes 12 seconds.


Author(s):  
Ravindra Babu Tallamaraju ◽  
Manas Kirti

With reducing cost of storage devices, increasing amounts of data is being stored and processed for extracting intelligence. Classification and clustering have been two major approaches in generating data abstraction. Over the last few years, text data is dominating the types of data shared and stored. Some of the sources of such datasets are mobile data, e-commerce, and wide-range of continuously expanding social-networking services. Within each of these sources, the nature of data differs drastically from formal language text to Twitter or SMS slangs thereby leading to the need for different ways of processing the data for making meaningful summarization. Such summaries could effectively be used for business advantage. Processing of such data requires identifying appropriate set of features both for efficiency and effectiveness. In the current Chapter, we propose to discuss approaches to text feature selection and make a comparative study.


Sign in / Sign up

Export Citation Format

Share Document