instance selection Latest Research Papers

Strategies for selecting the next data instance to label, in service of generating labeled data for machine learning, have been considered separately in the machine learning literature on active learning and in the visual analytics literature on human-centered approaches. We propose a unified design space for instance selection strategies to support detailed and fine-grained analysis covering both of these perspectives. We identify a concise set of 15 properties, namely measureable characteristics of datasets or of machine learning models applied to them, that cover most of the strategies in these literatures. To quantify these properties, we introduce Property Measures (PM) as fine-grained building blocks that can be used to formalize instance selection strategies. In addition, we present a taxonomy of PMs to support the description, evaluation, and generation of PMs across four dimensions: machine learning (ML) Model Output , Instance Relations , Measure Functionality , and Measure Valence . We also create computational infrastructure to support qualitative visual data analysis: a visual analytics explainer for PMs built around an implementation of PMs using cascades of eight atomic functions. It supports eight analysis tasks, covering the analysis of datasets and ML models using visual comparison within and between PMs and groups of PMs, and over time during the interactive labeling process. We iteratively refined the PM taxonomy, the explainer, and the task abstraction in parallel with each other during a two-year formative process, and show evidence of their utility through a summative evaluation with the same infrastructure. This research builds a formal baseline for the better understanding of the commonalities and differences of instance selection strategies, which can serve as the stepping stone for the synthesis of novel strategies in future work.

Download Full-text

Genetic Programming for Symbolic Regression on Incomplete Data

10.26686/wgtn.17150609.v1 ◽

2021 ◽

Author(s):

◽

Baligh Al-Helali

Keyword(s):

Feature Selection ◽

Transfer Learning ◽

Incomplete Data ◽

Missing Values ◽

Selection Method ◽

Symbolic Regression ◽

Instance Selection ◽

Data Sets ◽

Data Imputation ◽

Regression Methods

Symbolic regression is the process of constructing mathematical expressions that best fit given data sets, where a target variable is expressed in terms of input variables. Unlike traditional regression methods, which optimise the parameters of pre-defined models, symbolic regression learns both the model structure and its parameters simultaneously. Genetic programming (GP) is a biologically-inspired evolutionary algorithm, that automatically generates computer programs to solve a given task. The flexible representation of GP along with its ``white box" nature makes it a dominant method for symbolic regression. Moreover, GP has been successfully employed for different learning tasks such as feature selection and transfer learning. Data incompleteness is a pervasive problem in symbolic regression, and machine learning in general, especially when dealing with real-world data sets. One common approach to handling data missingness is data imputation. Data imputation is the process of estimating missing values based on existing data. Another approach to deal with incomplete data is to build learning algorithms that directly work with missing values. Although a number of methods have been proposed to tackle the data missingness issue in machine learning, most studies focus on classification tasks. Little attention has been paid to symbolic regression on incomplete data. The existing symbolic regression methods are only applicable when the given data set is complete. The overall goal of the thesis is to improve the performance of symbolic regression on incomplete data by using GP for data imputation, instance selection, feature selection, and transfer learning. This thesis develops an imputation method to handle missing values for symbolic regression. The method integrates the instance-based similarity of the k-nearest neighbour method with the feature-based predictability of GP to estimate the missing values. The results show that the proposed method outperforms existing popular imputation methods. This thesis develops an instance selection method for improving imputation for symbolic regression on incomplete data. The proposed method has the ability to simultaneously build imputation and symbolic regression models such that the performance is improved. The results show that involving instance selection with imputation advances the performance of using the imputation alone. High-dimensionality is a serious data challenge, which is even more difficult on incomplete data. To address this problem in symbolic regression tasks, this thesis develops a feature selection method that can select a good set of features directly from incomplete data. The method not only improves the regression accuracy, but also enhances the efficiency of symbolic regression on high-dimensional incomplete data. Another challenging problem is data shortage. This issue is even more challenging when the data is incomplete. To handle this situation, this thesis develops transfer learning methods to improve symbolic regression in domains with incomplete and limited data. These methods utilise two powerful abilities of GP: feature construction and feature selection. The results show the ability of these methods to achieve positive transfer learning from domains with complete data to different (but related) domains with incomplete data. In summary, the thesis develops a range of approaches to improving the effectiveness and efficiency of symbolic regression on incomplete data by developing a number of GP-based methods. The methods are evaluated using different types of data sets considering various missingness and learning scenarios.

Download Full-text

Genetic Programming for Symbolic Regression on Incomplete Data

10.26686/wgtn.17150609 ◽

2021 ◽

Author(s):

◽

Baligh Al-Helali

Keyword(s):

Feature Selection ◽

Transfer Learning ◽

Incomplete Data ◽

Missing Values ◽

Selection Method ◽

Symbolic Regression ◽

Instance Selection ◽

Data Sets ◽

Data Imputation ◽

Regression Methods

Symbolic regression is the process of constructing mathematical expressions that best fit given data sets, where a target variable is expressed in terms of input variables. Unlike traditional regression methods, which optimise the parameters of pre-defined models, symbolic regression learns both the model structure and its parameters simultaneously. Genetic programming (GP) is a biologically-inspired evolutionary algorithm, that automatically generates computer programs to solve a given task. The flexible representation of GP along with its ``white box" nature makes it a dominant method for symbolic regression. Moreover, GP has been successfully employed for different learning tasks such as feature selection and transfer learning. Data incompleteness is a pervasive problem in symbolic regression, and machine learning in general, especially when dealing with real-world data sets. One common approach to handling data missingness is data imputation. Data imputation is the process of estimating missing values based on existing data. Another approach to deal with incomplete data is to build learning algorithms that directly work with missing values. Although a number of methods have been proposed to tackle the data missingness issue in machine learning, most studies focus on classification tasks. Little attention has been paid to symbolic regression on incomplete data. The existing symbolic regression methods are only applicable when the given data set is complete. The overall goal of the thesis is to improve the performance of symbolic regression on incomplete data by using GP for data imputation, instance selection, feature selection, and transfer learning. This thesis develops an imputation method to handle missing values for symbolic regression. The method integrates the instance-based similarity of the k-nearest neighbour method with the feature-based predictability of GP to estimate the missing values. The results show that the proposed method outperforms existing popular imputation methods. This thesis develops an instance selection method for improving imputation for symbolic regression on incomplete data. The proposed method has the ability to simultaneously build imputation and symbolic regression models such that the performance is improved. The results show that involving instance selection with imputation advances the performance of using the imputation alone. High-dimensionality is a serious data challenge, which is even more difficult on incomplete data. To address this problem in symbolic regression tasks, this thesis develops a feature selection method that can select a good set of features directly from incomplete data. The method not only improves the regression accuracy, but also enhances the efficiency of symbolic regression on high-dimensional incomplete data. Another challenging problem is data shortage. This issue is even more challenging when the data is incomplete. To handle this situation, this thesis develops transfer learning methods to improve symbolic regression in domains with incomplete and limited data. These methods utilise two powerful abilities of GP: feature construction and feature selection. The results show the ability of these methods to achieve positive transfer learning from domains with complete data to different (but related) domains with incomplete data. In summary, the thesis develops a range of approaches to improving the effectiveness and efficiency of symbolic regression on incomplete data by developing a number of GP-based methods. The methods are evaluated using different types of data sets considering various missingness and learning scenarios.

Download Full-text

Fuzzy Clustering Decomposition of Genetic Algorithm-based Instance Selection for Regression Problems

Information Sciences ◽

10.1016/j.ins.2021.12.016 ◽

2021 ◽

Author(s):

Mirosław Kordos ◽

Marcin Blachnik ◽

Rafał Scherer

Keyword(s):

Genetic Algorithm ◽

Fuzzy Clustering ◽

Instance Selection ◽

Regression Problems ◽

Selection For

Download Full-text

Instance Selection dengan Naïve Bayes pada Klasifikasi Kanker Serviks

Jurnal Komtika ◽

10.31603/komtika.v5i2.6041 ◽

2021 ◽

Vol 5 (2) ◽

pp. 83-91

Author(s):

Fari Katul Fikriah

Keyword(s):

Decision Tree ◽

Missing Values ◽

Naive Bayes ◽

Healing Process ◽

Logistic Function ◽

Naïve Bayes ◽

Instance Selection ◽

Feature Selection Technique ◽

Deadly Disease ◽

Decision Tree Method

There are several deadly disease for woman, one of which is servical cancer. The growth and development of the disease is very slow, so that treatment if know form the beginning will facilitate the healing process, but conversely unknown cancers from the beginning will become dangereous and deadly disease due to relatively difficult healing. Biopsy action is one way to detect the presence of cancer. In the previous study, classification of cervical cancer had the bighest accuracy value of 97,515% using the decision tree method of several feature selection technique. for this reason, this research will use the decision tree or tree C4.5 classification method, logistic function and zeroR which were previously carried out processing with instance selection with Naïve Bayes by reducing the elimination of missing values with the aim of increasing the level of accuracy better than previous studies. C4.5 classification in this study has the most maximum results compared to other classification methods with an accuracy value of 99,69%.

Download Full-text

A Visual Mining Approach to Improved Multiple-Instance Learning

Algorithms ◽

10.3390/a14120344 ◽

2021 ◽

Vol 14 (12) ◽

pp. 344

Author(s):

Sonia Castelo ◽

Moacir Ponti ◽

Rosane Minghim

Keyword(s):

Machine Learning ◽

Supervised Learning ◽

Multiple Instance Learning ◽

Classification Model ◽

Instance Selection ◽

Selection Methods ◽

Training Set ◽

Learning Scenarios ◽

Visualization Techniques ◽

Initial Classification

Multiple-instance learning (MIL) is a paradigm of machine learning that aims to classify a set (bag) of objects (instances), assigning labels only to the bags. This problem is often addressed by selecting an instance to represent each bag, transforming an MIL problem into standard supervised learning. Visualization can be a useful tool to assess learning scenarios by incorporating the users’ knowledge into the classification process. Considering that multiple-instance learning is a paradigm that cannot be handled by current visualization techniques, we propose a multiscale tree-based visualization called MILTree to support MIL problems. The first level of the tree represents the bags, and the second level represents the instances belonging to each bag, allowing users to understand the MIL datasets in an intuitive way. In addition, we propose two new instance selection methods for MIL, which help users improve the model even further. Our methods can handle both binary and multiclass scenarios. In our experiments, SVM was used to build the classifiers. With support of the MILTree layout, the initial classification model was updated by changing the training set, which is composed of the prototype instances. Experimental results validate the effectiveness of our approach, showing that visual mining by MILTree can support exploring and improving models in MIL scenarios and that our instance selection methods outperform the currently available alternatives in most cases.

Download Full-text

Instance Selection Based on Linkage Trees

10.1109/cce53527.2021.9633116 ◽

2021 ◽

Author(s):

Samuel Omar Tovias-Alanis ◽

Wilfrido Gomez-Flores ◽

Gregorio Toscano-Pulido

Keyword(s):

Instance Selection

Download Full-text

ENRICH: Exploiting Image Similarity to Maximize Efficient Machine Learning in Medical Imaging

10.21203/rs.3.rs-1000939/v1 ◽

2021 ◽

Author(s):

Erin Chinn ◽

Rohit Arora ◽

Ramy Arnaout ◽

Rima Arnaout

Keyword(s):

Machine Learning ◽

Deep Learning ◽

Medical Imaging ◽

Image Similarity ◽

Instance Selection ◽

Computationally Efficient ◽

Training Set ◽

Present Evidence ◽

Medical Expertise ◽

Efficient Machine

Abstract Deep learning (DL) requires labeled data. Labeling medical images requires medical expertise, which is often a bottleneck. It is therefore useful to prioritize labeling those images that are most likely to improve a model's performance, a practice known as instance selection. Here we introduce ENRICH, a method that selects images for labeling based on how much novelty each image adds to the growing training set. In our implementation, we use cosine similarity between autoencoder embeddings to measure that novelty. We show that ENRICH achieves nearly maximal performance on classification and segmentation tasks using only a fraction of available images, and outperforms the default practice of selecting images at random. We also present evidence that instance selection may perform categorically better on medical vs. non-medical imaging tasks. In conclusion, ENRICH is a simple, computationally efficient method for prioritizing images for expert labeling for DL.

Download Full-text

instance selection
Recently Published Documents

TOTAL DOCUMENTS

H-INDEX

Three-objective constrained evolutionary instance selection for classification: Wrapper and filter approaches

Ensemble Learning Based Collaborative Filtering with Instance Selection and Enhanced Clustering

A Taxonomy of Property Measures to Unify Active Learning and Human-centered Approaches to Data Labeling

Genetic Programming for Symbolic Regression on Incomplete Data

Genetic Programming for Symbolic Regression on Incomplete Data

Fuzzy Clustering Decomposition of Genetic Algorithm-based Instance Selection for Regression Problems

Instance Selection dengan Naïve Bayes pada Klasifikasi Kanker Serviks

A Visual Mining Approach to Improved Multiple-Instance Learning

Instance Selection Based on Linkage Trees

ENRICH: Exploiting Image Similarity to Maximize Efficient Machine Learning in Medical Imaging

Export Citation Format

instance selectionRecently Published Documents

TOTAL DOCUMENTS

H-INDEX

Three-objective constrained evolutionary instance selection for classification: Wrapper and filter approaches

Ensemble Learning Based Collaborative Filtering with Instance Selection and Enhanced Clustering

A Taxonomy of Property Measures to Unify Active Learning and Human-centered Approaches to Data Labeling

Genetic Programming for Symbolic Regression on Incomplete Data

Genetic Programming for Symbolic Regression on Incomplete Data

Fuzzy Clustering Decomposition of Genetic Algorithm-based Instance Selection for Regression Problems

Instance Selection dengan Naïve Bayes pada Klasifikasi Kanker Serviks

A Visual Mining Approach to Improved Multiple-Instance Learning

Instance Selection Based on Linkage Trees

ENRICH: Exploiting Image Similarity to Maximize Efficient Machine Learning in Medical Imaging

instance selection
Recently Published Documents