Combining Feature Selection and Feature Construction to Improve Concept Learning for High Dimensional Data

Author(s):  
Blaise Hanczar
2021 ◽  
Author(s):  
Binh Tran ◽  
Bing Xue ◽  
Mengjie Zhang

Classification on high-dimensional data with thousands to tens of thousands of dimensions is a challenging task due to the high dimensionality and the quality of the feature set. The problem can be addressed by using feature selection to choose only informative features or feature construction to create new high-level features. Genetic programming (GP) using a tree-based representation can be used for both feature construction and implicit feature selection. This work presents a comprehensive study to investigate the use of GP for feature construction and selection on high-dimensional classification problems. Different combinations of the constructed and/or selected features are tested and compared on seven high-dimensional gene expression problems, and different classification algorithms are used to evaluate their performance. The results show that the constructed and/or selected feature sets can significantly reduce the dimensionality and maintain or even increase the classification accuracy in most cases. The cases with overfitting occurred are analysed via the distribution of features. Further analysis is also performed to show why the constructed feature can achieve promising classification performance. This is a post-peer-review, pre-copyedit version of an article published in 'Memetic Computing'. The final authenticated version is available online at: https://doi.org/10.1007/s12293-015-0173-y. The following terms of use apply: https://www.springer.com/gp/open-access/publication-policies/aam-terms-of-use.


2021 ◽  
Author(s):  
Binh Tran ◽  
Bing Xue ◽  
Mengjie Zhang

Classification on high-dimensional data with thousands to tens of thousands of dimensions is a challenging task due to the high dimensionality and the quality of the feature set. The problem can be addressed by using feature selection to choose only informative features or feature construction to create new high-level features. Genetic programming (GP) using a tree-based representation can be used for both feature construction and implicit feature selection. This work presents a comprehensive study to investigate the use of GP for feature construction and selection on high-dimensional classification problems. Different combinations of the constructed and/or selected features are tested and compared on seven high-dimensional gene expression problems, and different classification algorithms are used to evaluate their performance. The results show that the constructed and/or selected feature sets can significantly reduce the dimensionality and maintain or even increase the classification accuracy in most cases. The cases with overfitting occurred are analysed via the distribution of features. Further analysis is also performed to show why the constructed feature can achieve promising classification performance. This is a post-peer-review, pre-copyedit version of an article published in 'Memetic Computing'. The final authenticated version is available online at: https://doi.org/10.1007/s12293-015-0173-y. The following terms of use apply: https://www.springer.com/gp/open-access/publication-policies/aam-terms-of-use.


2021 ◽  
Author(s):  
◽  
Binh Ngan Tran

<p>More and more high-dimensional data appears in machine learning, especially in classification tasks. With thousands of features, these datasets bring challenges to learning algorithms not only because of the curse of dimensionality but also the existence of many irrelevant and redundant features. Therefore, feature selection and feature construction (or feature manipulation in short) are essential techniques in preprocessing these datasets. While feature selection aims to select relevant features, feature construction constructs high-level features from the original ones to better represent the target concept. Both methods can decrease the dimensionality and improve the performance of learning algorithms in terms of classification accuracy and computation time.  Although feature manipulation has been studied for decades, the task on high-dimensional data is still challenging due to the huge search space. Existing methods usually face the problem of stagnation in local optima and/or require high computation time. Evolutionary computation techniques are well-known for their global search. Particle swarm optimisation (PSO) and genetic programming (GP) have shown promise in feature selection and feature construction, respectively. However, the use of these techniques to high-dimensional data usually requires high memory and computation time.  The overall goal of this thesis is to investigate new approaches to using PSO for feature selection and GP for feature construction on high-dimensional classification problems. This thesis focuses on incorporating a variety of strategies into the evolutionary process and developing new PSO and GP representations to improve the effectiveness and efficiency of PSO and GP for feature manipulation on high-dimensional data.  This thesis proposes a new PSO based feature selection approach to high-dimensional data by incorporating a new local search to balance global and local search of PSO. A hybrid of wrapper and filter evaluation method which can be sped up in the local search is proposed to help PSO achieve better performance, scalability and robustness on high-dimensional data. The results show that the proposed method significantly outperforms the compared methods in 80% of the cases with an increase up to 16% average accuracy while reduces the number of features from one to two orders of magnitude.  This thesis develops the first PSO based feature selection via discretisation method that performs both multivariate discretisation and feature selection in a single stage to achieve better solutions than applying these techniques separately in two stages. Two new PSO representations are proposed to evolve cut-points for multiple features simultaneously. The results show that the proposed method selects less than 4.6% of the features in all cases to improve the classification performance from 5% to 23% in most cases.  This thesis proposes the first clustering-based feature construction method to improve the performance of single-tree GP on high-dimensional data. A new feature clustering method is proposed to automatically group similar features into the same group based on a given redundancy level. The results show that compared with standard GP, the new method can select less than half of the features to construct a new high-level feature that achieves significantly better accuracy in most cases. The combination of the single constructed feature and the selected ones achieves the best performance among different feature sets created from a single tree.  This thesis develops the first class-dependent multiple feature construction method using multi-tree GP for high-dimensional data. A new GP representation and a new filter fitness function that combines two filter measures are proposed to evaluate the whole set of constructed features more effectively and efficiently. The results show that in 83% of the cases, with less than 10 constructed features, the class-dependent method increases up to 32% average accuracy on using all the original thousands of features and 10% on using those constructed by the class-independent method.</p>


2021 ◽  
Author(s):  
◽  
Binh Ngan Tran

<p>More and more high-dimensional data appears in machine learning, especially in classification tasks. With thousands of features, these datasets bring challenges to learning algorithms not only because of the curse of dimensionality but also the existence of many irrelevant and redundant features. Therefore, feature selection and feature construction (or feature manipulation in short) are essential techniques in preprocessing these datasets. While feature selection aims to select relevant features, feature construction constructs high-level features from the original ones to better represent the target concept. Both methods can decrease the dimensionality and improve the performance of learning algorithms in terms of classification accuracy and computation time.  Although feature manipulation has been studied for decades, the task on high-dimensional data is still challenging due to the huge search space. Existing methods usually face the problem of stagnation in local optima and/or require high computation time. Evolutionary computation techniques are well-known for their global search. Particle swarm optimisation (PSO) and genetic programming (GP) have shown promise in feature selection and feature construction, respectively. However, the use of these techniques to high-dimensional data usually requires high memory and computation time.  The overall goal of this thesis is to investigate new approaches to using PSO for feature selection and GP for feature construction on high-dimensional classification problems. This thesis focuses on incorporating a variety of strategies into the evolutionary process and developing new PSO and GP representations to improve the effectiveness and efficiency of PSO and GP for feature manipulation on high-dimensional data.  This thesis proposes a new PSO based feature selection approach to high-dimensional data by incorporating a new local search to balance global and local search of PSO. A hybrid of wrapper and filter evaluation method which can be sped up in the local search is proposed to help PSO achieve better performance, scalability and robustness on high-dimensional data. The results show that the proposed method significantly outperforms the compared methods in 80% of the cases with an increase up to 16% average accuracy while reduces the number of features from one to two orders of magnitude.  This thesis develops the first PSO based feature selection via discretisation method that performs both multivariate discretisation and feature selection in a single stage to achieve better solutions than applying these techniques separately in two stages. Two new PSO representations are proposed to evolve cut-points for multiple features simultaneously. The results show that the proposed method selects less than 4.6% of the features in all cases to improve the classification performance from 5% to 23% in most cases.  This thesis proposes the first clustering-based feature construction method to improve the performance of single-tree GP on high-dimensional data. A new feature clustering method is proposed to automatically group similar features into the same group based on a given redundancy level. The results show that compared with standard GP, the new method can select less than half of the features to construct a new high-level feature that achieves significantly better accuracy in most cases. The combination of the single constructed feature and the selected ones achieves the best performance among different feature sets created from a single tree.  This thesis develops the first class-dependent multiple feature construction method using multi-tree GP for high-dimensional data. A new GP representation and a new filter fitness function that combines two filter measures are proposed to evaluate the whole set of constructed features more effectively and efficiently. The results show that in 83% of the cases, with less than 10 constructed features, the class-dependent method increases up to 32% average accuracy on using all the original thousands of features and 10% on using those constructed by the class-independent method.</p>


2021 ◽  
Vol 26 (1) ◽  
pp. 67-77
Author(s):  
Siva Sankari Subbiah ◽  
Jayakumar Chinnappan

Now a day, all the organizations collecting huge volume of data without knowing its usefulness. The fast development of Internet helps the organizations to capture data in many different formats through Internet of Things (IoT), social media and from other disparate sources. The dimension of the dataset increases day by day at an extraordinary rate resulting in large scale dataset with high dimensionality. The present paper reviews the opportunities and challenges of feature selection for processing the high dimensional data with reduced complexity and improved accuracy. In the modern big data world the feature selection has a significance in reducing the dimensionality and overfitting of the learning process. Many feature selection methods have been proposed by researchers for obtaining more relevant features especially from the big datasets that helps to provide accurate learning results without degradation in performance. This paper discusses the importance of feature selection, basic feature selection approaches, centralized and distributed big data processing using Hadoop and Spark, challenges of feature selection and provides the summary of the related research work done by various researchers. As a result, the big data analysis with the feature selection improves the accuracy of the learning.


Sign in / Sign up

Export Citation Format

Share Document