Genetic Programming-Based Simultaneous Feature Selection and Imputation for Symbolic Regression with Incomplete Data

<p>Classification is a major task in machine learning and data mining. Many real-world datasets suffer from the unavoidable issue of missing values. Classification with incomplete data has to be carefully handled because inadequate treatment of missing values will cause large classification errors. Existing most researchers working on classification with incomplete data focused on improving the effectiveness, but did not adequately address the issue of the efficiency of applying the classifiers to classify unseen instances, which is much more important than the act of creating classifiers. A common approach to classification with incomplete data is to use imputation methods to replace missing values with plausible values before building classifiers and classifying unseen instances. This approach provides complete data which can be then used by any classification algorithm, but sophisticated imputation methods are usually computationally intensive, especially for the application process of classification. Another approach to classification with incomplete data is to build a classifier that can directly work with missing values. This approach does not require time for estimating missing values, but it often generates inaccurate and complex classifiers when faced with numerous missing values. A recent approach to classification with incomplete data which also avoids estimating missing values is to build a set of classifiers which then is used to select applicable classifiers for classifying unseen instances. However, this approach is also often inaccurate and takes a long time to find applicable classifiers when faced with numerous missing values. The overall goal of the thesis is to simultaneously improve the effectiveness and efficiency of classification with incomplete data by using evolutionary machine learning techniques for feature selection, clustering, ensemble learning, feature construction and constructing classifiers. The thesis develops approaches for improving imputation for classification with incomplete data by integrating clustering and feature selection with imputation. The approaches improve both the effectiveness and the efficiency of using imputation for classification with incomplete data. The thesis develops wrapper-based feature selection methods to improve input space for classification algorithms that are able to work directly with incomplete data. The methods not only improve the classification accuracy, but also reduce the complexity of classifiers able to work directly with incomplete data. The thesis develops a feature construction method to improve input space for classification algorithms with incomplete data by proposing interval genetic programming-genetic programming with a set of interval functions. The method improves the classification accuracy and reduces the complexity of classifiers. The thesis develops an ensemble approach to classification with incomplete data by integrating imputation, feature selection, and ensemble learning. The results show that the approach is more accurate, and faster than previous common methods for classification with incomplete data. The thesis develops interval genetic programming to directly evolve classifiers for incomplete data. The results show that classifiers generated by interval genetic programming can be more effective and efficient than classifiers generated the combination of imputation and traditional genetic programming. Interval genetic programming is also more effective than common classification algorithms able to work directly with incomplete data. In summary, the thesis develops a range of approaches for simultaneously improving the effectiveness and efficiency of classification with incomplete data by using a range of evolutionary machine learning techniques.</p>

Download Full-text

Feature Selection to Improve Generalization of Genetic Programming for High-Dimensional Symbolic Regression

IEEE Transactions on Evolutionary Computation ◽

10.1109/tevc.2017.2683489 ◽

2017 ◽

Vol 21 (5) ◽

pp. 792-806 ◽

Cited By ~ 41

Author(s):

Qi Chen ◽

Mengjie Zhang ◽

Bing Xue

Keyword(s):

Feature Selection ◽

Genetic Programming ◽

Symbolic Regression ◽

High Dimensional

Download Full-text

Multi-tree genetic programming for feature construction-based domain adaptation in symbolic regression with incomplete data

Proceedings of the 2020 Genetic and Evolutionary Computation Conference ◽

10.1145/3377930.3390160 ◽

2020 ◽

Cited By ~ 3

Author(s):

Baligh Al-Helali ◽

Qi Chen ◽

Bing Xue ◽

Mengjie Zhang

Keyword(s):

Genetic Programming ◽

Incomplete Data ◽

Domain Adaptation ◽

Symbolic Regression ◽

Feature Construction

Download Full-text

Evolutionary Machine Learning for Classification with Incomplete Data

10.26686/wgtn.17072123.v1 ◽

2021 ◽

Author(s):

◽

Cao Truong Tran

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Genetic Programming ◽

Incomplete Data ◽

Missing Values ◽

Machine Learning Techniques ◽

Feature Construction ◽

Classification Algorithms ◽

Learning Techniques ◽

Effectiveness And Efficiency

<p>Classification is a major task in machine learning and data mining. Many real-world datasets suffer from the unavoidable issue of missing values. Classification with incomplete data has to be carefully handled because inadequate treatment of missing values will cause large classification errors. Existing most researchers working on classification with incomplete data focused on improving the effectiveness, but did not adequately address the issue of the efficiency of applying the classifiers to classify unseen instances, which is much more important than the act of creating classifiers. A common approach to classification with incomplete data is to use imputation methods to replace missing values with plausible values before building classifiers and classifying unseen instances. This approach provides complete data which can be then used by any classification algorithm, but sophisticated imputation methods are usually computationally intensive, especially for the application process of classification. Another approach to classification with incomplete data is to build a classifier that can directly work with missing values. This approach does not require time for estimating missing values, but it often generates inaccurate and complex classifiers when faced with numerous missing values. A recent approach to classification with incomplete data which also avoids estimating missing values is to build a set of classifiers which then is used to select applicable classifiers for classifying unseen instances. However, this approach is also often inaccurate and takes a long time to find applicable classifiers when faced with numerous missing values. The overall goal of the thesis is to simultaneously improve the effectiveness and efficiency of classification with incomplete data by using evolutionary machine learning techniques for feature selection, clustering, ensemble learning, feature construction and constructing classifiers. The thesis develops approaches for improving imputation for classification with incomplete data by integrating clustering and feature selection with imputation. The approaches improve both the effectiveness and the efficiency of using imputation for classification with incomplete data. The thesis develops wrapper-based feature selection methods to improve input space for classification algorithms that are able to work directly with incomplete data. The methods not only improve the classification accuracy, but also reduce the complexity of classifiers able to work directly with incomplete data. The thesis develops a feature construction method to improve input space for classification algorithms with incomplete data by proposing interval genetic programming-genetic programming with a set of interval functions. The method improves the classification accuracy and reduces the complexity of classifiers. The thesis develops an ensemble approach to classification with incomplete data by integrating imputation, feature selection, and ensemble learning. The results show that the approach is more accurate, and faster than previous common methods for classification with incomplete data. The thesis develops interval genetic programming to directly evolve classifiers for incomplete data. The results show that classifiers generated by interval genetic programming can be more effective and efficient than classifiers generated the combination of imputation and traditional genetic programming. Interval genetic programming is also more effective than common classification algorithms able to work directly with incomplete data. In summary, the thesis develops a range of approaches for simultaneously improving the effectiveness and efficiency of classification with incomplete data by using a range of evolutionary machine learning techniques.</p>

Download Full-text

Feature selection to improve generalization of genetic programming for high-dimensional symbolic regression

10.26686/wgtn.14273315 ◽

2021 ◽

Author(s):

Qi Chen ◽

Mengjie Zhang ◽

Bing Xue

Keyword(s):

Feature Selection ◽

Genetic Programming ◽

Feature Selection Method ◽

Selection Method ◽

Symbolic Regression ◽

Superior Performance ◽

High Dimensional ◽

Selection Methods ◽

Personal Use ◽

New Feature

When learning from high-dimensional data for symbolic regression (SR), genetic programming (GP) typically could not generalize well. Feature selection, as a data preprocessing method, can potentially contribute not only to improving the efficiency of learning algorithms but also to enhancing the generalization ability. However, in GP for high-dimensional SR, feature selection before learning is seldom considered. In this paper, we propose a new feature selection method based on permutation to select features for high-dimensional SR using GP. A set of experiments has been conducted to investigate the performance of the proposed method on the generalization of GP for high-dimensional SR. The regression results confirm the superior performance of the proposed method over the other examined feature selection methods. Further analysis indicates that the models evolved by the proposed method are more likely to contain only the truly relevant features and have better interpretability. © 2017 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.

Download Full-text

Genetic Programming with Noise Sensitivity for Imputation Predictor Selection in Symbolic Regression with Incomplete Data

2020 IEEE Congress on Evolutionary Computation (CEC) ◽

10.1109/cec48606.2020.9185526 ◽

2020 ◽

Author(s):

Baligh Al-Helali ◽

Qi Chen ◽

Bing Xue ◽

Mengjie Zhang

Keyword(s):

Genetic Programming ◽

Incomplete Data ◽

Symbolic Regression ◽

Noise Sensitivity

Download Full-text

Improving generalisation of genetic programming for high-dimensional symbolic regression with feature selection

2016 IEEE Congress on Evolutionary Computation (CEC) ◽

10.1109/cec.2016.7744270 ◽

2016 ◽

Cited By ~ 10

Author(s):

Qi Chen ◽

Bing Xue ◽

Ben Niu ◽

Mengjie Zhang

Keyword(s):

Feature Selection ◽

Genetic Programming ◽

Symbolic Regression ◽

High Dimensional

Download Full-text

Feature selection to improve generalization of genetic programming for high-dimensional symbolic regression

10.26686/wgtn.14273315.v1 ◽

2021 ◽

Author(s):

Qi Chen ◽

Mengjie Zhang ◽

Bing Xue

Keyword(s):

Feature Selection ◽

Genetic Programming ◽

Feature Selection Method ◽

Selection Method ◽

Symbolic Regression ◽

Superior Performance ◽

High Dimensional ◽

Selection Methods ◽

Personal Use ◽

New Feature

When learning from high-dimensional data for symbolic regression (SR), genetic programming (GP) typically could not generalize well. Feature selection, as a data preprocessing method, can potentially contribute not only to improving the efficiency of learning algorithms but also to enhancing the generalization ability. However, in GP for high-dimensional SR, feature selection before learning is seldom considered. In this paper, we propose a new feature selection method based on permutation to select features for high-dimensional SR using GP. A set of experiments has been conducted to investigate the performance of the proposed method on the generalization of GP for high-dimensional SR. The regression results confirm the superior performance of the proposed method over the other examined feature selection methods. Further analysis indicates that the models evolved by the proposed method are more likely to contain only the truly relevant features and have better interpretability. © 2017 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.

Download Full-text