Interval-valued fuzzy-rough feature selection in datasets with missing values

With the advent of new technologies in the medical field, huge amounts of cancerous data have been collected and are readily accessible to the medical research community. Over the years, researchers have employed advanced data mining and machine learning techniques to develop better models that can analyze datasets to extract the conceived patterns, ideas, and hidden knowledge. The mined information can be used as a support in decision making for diagnostic processes. These techniques, while being able to predict future outcomes of certain diseases effectively, can discover and identify patterns and relationships between them from complex datasets. In this research, a predictive model for predicting the outcome of patients’ cervical cancer results has been developed, given risk patterns from individual medical records and preliminary screening tests. This work presents a Decision tree (DT) classification algorithm and shows the advantage of feature selection approaches in the prediction of cervical cancer using recursive feature elimination technique for dimensionality reduction for improving the accuracy, sensitivity, and specificity of the model. The dataset employed here suffers from missing values and is highly imbalanced. Therefore, a combination of under and oversampling techniques called SMOTETomek was employed. A comparative analysis of the proposed model has been performed to show the effectiveness of feature selection and class imbalance based on the classifier’s accuracy, sensitivity, and specificity. The DT with the selected features and SMOTETomek has better results with an accuracy of 98%, sensitivity of 100%, and specificity of 97%. Decision Tree classifier is shown to have excellent performance in handling classification assignment when the features are reduced, and the problem of imbalance class is addressed.

Download Full-text

Evolutionary Machine Learning for Classification with Incomplete Data

10.26686/wgtn.17072123 ◽

2021 ◽

Author(s):

◽

Cao Truong Tran

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Genetic Programming ◽

Incomplete Data ◽

Missing Values ◽

Machine Learning Techniques ◽

Feature Construction ◽

Classification Algorithms ◽

Learning Techniques ◽

Effectiveness And Efficiency

<p>Classification is a major task in machine learning and data mining. Many real-world datasets suffer from the unavoidable issue of missing values. Classification with incomplete data has to be carefully handled because inadequate treatment of missing values will cause large classification errors. Existing most researchers working on classification with incomplete data focused on improving the effectiveness, but did not adequately address the issue of the efficiency of applying the classifiers to classify unseen instances, which is much more important than the act of creating classifiers. A common approach to classification with incomplete data is to use imputation methods to replace missing values with plausible values before building classifiers and classifying unseen instances. This approach provides complete data which can be then used by any classification algorithm, but sophisticated imputation methods are usually computationally intensive, especially for the application process of classification. Another approach to classification with incomplete data is to build a classifier that can directly work with missing values. This approach does not require time for estimating missing values, but it often generates inaccurate and complex classifiers when faced with numerous missing values. A recent approach to classification with incomplete data which also avoids estimating missing values is to build a set of classifiers which then is used to select applicable classifiers for classifying unseen instances. However, this approach is also often inaccurate and takes a long time to find applicable classifiers when faced with numerous missing values. The overall goal of the thesis is to simultaneously improve the effectiveness and efficiency of classification with incomplete data by using evolutionary machine learning techniques for feature selection, clustering, ensemble learning, feature construction and constructing classifiers. The thesis develops approaches for improving imputation for classification with incomplete data by integrating clustering and feature selection with imputation. The approaches improve both the effectiveness and the efficiency of using imputation for classification with incomplete data. The thesis develops wrapper-based feature selection methods to improve input space for classification algorithms that are able to work directly with incomplete data. The methods not only improve the classification accuracy, but also reduce the complexity of classifiers able to work directly with incomplete data. The thesis develops a feature construction method to improve input space for classification algorithms with incomplete data by proposing interval genetic programming-genetic programming with a set of interval functions. The method improves the classification accuracy and reduces the complexity of classifiers. The thesis develops an ensemble approach to classification with incomplete data by integrating imputation, feature selection, and ensemble learning. The results show that the approach is more accurate, and faster than previous common methods for classification with incomplete data. The thesis develops interval genetic programming to directly evolve classifiers for incomplete data. The results show that classifiers generated by interval genetic programming can be more effective and efficient than classifiers generated the combination of imputation and traditional genetic programming. Interval genetic programming is also more effective than common classification algorithms able to work directly with incomplete data. In summary, the thesis develops a range of approaches for simultaneously improving the effectiveness and efficiency of classification with incomplete data by using a range of evolutionary machine learning techniques.</p>

Download Full-text

Feature selection for interval-valued data based on D-S evidence theory

IEEE Access ◽

10.1109/access.2021.3109013 ◽

2021 ◽

pp. 1-1

Author(s):

Yichun Peng ◽

Qinli Zhang

Keyword(s):

Feature Selection ◽

Evidence Theory ◽

Selection For ◽

Interval Valued

Download Full-text

Causal Feature Selection with Missing Data

ACM Transactions on Knowledge Discovery from Data ◽

10.1145/3488055 ◽

2022 ◽

Vol 16 (4) ◽

pp. 1-24

Author(s):

Kui Yu ◽

Yajing Yang ◽

Wei Ding

Keyword(s):

Feature Selection ◽

Missing Data ◽

Real World ◽

Missing Values ◽

Prediction Models ◽

Causal Structure ◽

Data Imputation ◽

Accurate Data ◽

Unified Framework ◽

Class Variable

Causal feature selection aims at learning the Markov blanket (MB) of a class variable for feature selection. The MB of a class variable implies the local causal structure among the class variable and its MB and all other features are probabilistically independent of the class variable conditioning on its MB, this enables causal feature selection to identify potential causal features for feature selection for building robust and physically meaningful prediction models. Missing data, ubiquitous in many real-world applications, remain an open research problem in causal feature selection due to its technical complexity. In this article, we discuss a novel multiple imputation MB (MimMB) framework for causal feature selection with missing data. MimMB integrates Data Imputation with MB Learning in a unified framework to enable the two key components to engage with each other. MB Learning enables Data Imputation in a potentially causal feature space for achieving accurate data imputation, while accurate Data Imputation helps MB Learning identify a reliable MB of the class variable in turn. Then, we further design an enhanced kNN estimator for imputing missing values and instantiate the MimMB. In our comprehensively experimental evaluation, our new approach can effectively learn the MB of a given variable in a Bayesian network and outperforms other rival algorithms using synthetic and real-world datasets.

Download Full-text

DE-Based RBFNs for Classification With Special Attention to Noise Removal and Irrelevant Features

Advances in Computational Intelligence and Robotics - Handbook of Research on Modeling, Analysis, and Application of Nature-Inspired Metaheuristic Algorithms ◽

10.4018/978-1-5225-2857-9.ch011 ◽

2018 ◽

pp. 218-243

Author(s):

Ch. Sanjeev Kumar Dash ◽

Ajit Kumar Behera ◽

Sarat Chandra Nayak

Keyword(s):

Feature Selection ◽

Radial Basis Function ◽

Basis Function ◽

Missing Values ◽

Information Gain ◽

Noise Removal ◽

Network Efficiency ◽

Novel Approach ◽

Radial Basis ◽

Filter Approach

This chapter presents a novel approach for classification of dataset by suitably tuning the parameters of radial basis function networks with an additional cost of feature selection. Inputting optimal and relevant set of features to a radial basis function may greatly enhance the network efficiency (in terms of accuracy) at the same time compact its size. In this chapter, the authors use information gain theory (a kind of filter approach) for reducing the features and differential evolution for tuning center and spread of radial basis functions. Different feature selection methods, handling missing values and removal of inconsistency to improve the classification accuracy of the proposed model are emphasized. The proposed approach is validated with a few benchmarking highly skewed and balanced dataset retrieved from University of California, Irvine (UCI) repository. The experimental study is encouraging to pursue further extensive research in highly skewed data.

Download Full-text

Application of interval-valued aggregation to optimization problem ofk−NNclassifiers for missing values case

Information Sciences ◽

10.1016/j.ins.2019.02.053 ◽

2019 ◽

Vol 486 ◽

pp. 434-449 ◽

Cited By ~ 4

Author(s):

Urszula Bentkowska ◽

Jan G. Bazan ◽

Wojciech Rząsa ◽

Lech Zarȩba

Keyword(s):

Optimization Problem ◽

Missing Values ◽

Interval Valued

Download Full-text

Multi-criteria feature selection on cost-sensitive data with missing values

Pattern Recognition ◽

10.1016/j.patcog.2015.09.016 ◽

2016 ◽

Vol 51 ◽

pp. 268-280 ◽

Cited By ~ 13

Author(s):

Wenhao Shu ◽

Hong Shen

Keyword(s):

Feature Selection ◽

Missing Values ◽

Sensitive Data

Download Full-text

Estimation of missing values in fuzzy matrices (FM) and interval-valued fuzzy matrices (IVFM)

Life Cycle Reliability and Safety Engineering ◽

10.1007/s41872-020-00116-1 ◽

2020 ◽

Vol 9 (3) ◽

pp. 241-245

Author(s):

D. S. Hooda ◽

M. S. Barak

Keyword(s):

Missing Values ◽

Interval Valued

Download Full-text

Inverse Classification Problem of Quantitative Attributes

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.44-47.3538 ◽

2010 ◽

Vol 44-47 ◽

pp. 3538-3542

Author(s):

Ai Guo Li ◽

Xin Zhou ◽

Jiu Long Zhang

Keyword(s):

Feature Selection ◽

Missing Values ◽

Main Idea ◽

Classification Problem ◽

Experimental Results ◽

Classification Algorithms ◽

Selection Algorithm ◽

Feature Selection Algorithm ◽

Class Label ◽

Training Samples

In order to overcome the disadvantage of most inverse classification algorithms address discrete attributes and can not deal with quantitative attributes. The discretization algorithms are applied to the inverse classification algorithms, and the main idea is: firstly, a group of feature attributes are selected by using feature selection algorithm; then, the quantitative attributes are discretized by using discretization algorithms, and the inverted statistics are constructed on the training samples; finally, the test samples are analyzed. Experimental results on IRIS and Ecoli datasets show that this method could find the class label effectively and estimate the missing values accurately, and the results were not worse than ISGNN and kNN.

Download Full-text