An Incremental Classification Algorithm for Mining Data with Feature Space Heterogeneity

Feature space heterogeneity often exists in many real world data sets so that some features are of different importance for classification over different subsets. Moreover, the pattern of feature space heterogeneity might dynamically change over time as more and more data are accumulated. In this paper, we develop an incremental classification algorithm, Supervised Clustering for Classification with Feature Space Heterogeneity (SCCFSH), to address this problem. In our approach, supervised clustering is implemented to obtain a number of clusters such that samples in each cluster are from the same class. After the removal of outliers, relevance of features in each cluster is calculated based on their variations in this cluster. The feature relevance is incorporated into distance calculation for classification. The main advantage of SCCFSH lies in the fact that it is capable of solving a classification problem with feature space heterogeneity in an incremental way, which is favorable for online classification tasks with continuously changing data. Experimental results on a series of data sets and application to a database marketing problem show the efficiency and effectiveness of the proposed approach.

Download Full-text

A systematical approach to classification problems with feature space heterogeneity

Kybernetes ◽

10.1108/k-06-2018-0313 ◽

2019 ◽

Vol 48 (9) ◽

pp. 2006-2029

Author(s):

Hongshan Xiao ◽

Yu Wang

Keyword(s):

Factor Analysis ◽

Meta Analysis ◽

Feature Space ◽

Classification Performance ◽

Classification Algorithm ◽

Significant Feature ◽

Data Sets ◽

Data Set ◽

Classification Techniques ◽

Content Type

Purpose Feature space heterogeneity exists widely in various application fields of classification techniques, such as customs inspection decision, credit scoring and medical diagnosis. This paper aims to study the relationship between feature space heterogeneity and classification performance. Design/methodology/approach A measurement is first developed for measuring and identifying any significant heterogeneity that exists in the feature space of a data set. The main idea of this measurement is derived from a meta-analysis. For the data set with significant feature space heterogeneity, a classification algorithm based on factor analysis and clustering is proposed to learn the data patterns, which, in turn, are used for data classification. Findings The proposed approach has two main advantages over the previous methods. The first advantage lies in feature transform using orthogonal factor analysis, which results in new features without redundancy and irrelevance. The second advantage rests on samples partitioning to capture the feature space heterogeneity reflected by differences of factor scores. The validity and effectiveness of the proposed approach is verified on a number of benchmarking data sets. Research limitations/implications Measurement should be used to guide the heterogeneity elimination process, which is an interesting topic in future research. In addition, to develop a classification algorithm that enables scalable and incremental learning for large data sets with significant feature space heterogeneity is also an important issue. Practical implications Measuring and eliminating the feature space heterogeneity possibly existing in the data are important for accurate classification. This study provides a systematical approach to feature space heterogeneity measurement and elimination for better classification performance, which is favorable for applications of classification techniques in real-word problems. Originality/value A measurement based on meta-analysis for measuring and identifying any significant feature space heterogeneity in a classification problem is developed, and an ensemble classification framework is proposed to deal with the feature space heterogeneity and improve the classification accuracy.

Download Full-text

Preventing Disparate Treatment in Sequential Decision Making

Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2018/311 ◽

2018 ◽

Cited By ~ 1

Author(s):

Hoda Heidari ◽

Andreas Krause

Keyword(s):

Decision Making ◽

Learning Algorithm ◽

Feature Space ◽

Sequential Decision Making ◽

Data Sets ◽

Sequential Decision ◽

Real World Data ◽

Time Step ◽

Job Application ◽

Disparate Treatment

We study fairness in sequential decision making environments, where at each time step a learning algorithm receives data corresponding to a new individual (e.g. a new job application) and must make an irrevocable decision about him/her (e.g. whether to hire the applicant) based on observations made so far. In order to prevent cases of disparate treatment, our time-dependent notion of fairness requires algorithmic decisions to be consistent: if two individuals are similar in the feature space and arrive during the same time epoch, the algorithm must assign them to similar outcomes. We propose a general framework for post-processing predictions made by a black-box learning model, that guarantees the resulting sequence of outcomes is consistent. We show theoretically that imposing consistency will not significantly slow down learning. Our experiments on two real-world data sets illustrate and confirm this finding in practice.

Download Full-text

Feature Selection for Ridge Regression with Provable Guarantees

Neural Computation ◽

10.1162/neco_a_00816 ◽

2016 ◽

Vol 28 (4) ◽

pp. 716-742 ◽

Cited By ~ 9

Author(s):

Saurabh Paul ◽

Petros Drineas

Keyword(s):

Feature Selection ◽

Ridge Regression ◽

Feature Selection Method ◽

Feature Space ◽

Data Sets ◽

Real World Data ◽

Feature Selection Technique ◽

Worst Case ◽

Single Set ◽

Classification Function

We introduce single-set spectral sparsification as a deterministic sampling–based feature selection technique for regularized least-squares classification, which is the classification analog to ridge regression. The method is unsupervised and gives worst-case guarantees of the generalization power of the classification function after feature selection with respect to the classification function obtained using all features. We also introduce leverage-score sampling as an unsupervised randomized feature selection method for ridge regression. We provide risk bounds for both single-set spectral sparsification and leverage-score sampling on ridge regression in the fixed design setting and show that the risk in the sampled space is comparable to the risk in the full-feature space. We perform experiments on synthetic and real-world data sets; a subset of TechTC-300 data sets, to support our theory. Experimental results indicate that the proposed methods perform better than the existing feature selection methods.

Download Full-text

A Three-Level Optimization Model for Nonlinearly Separable Clustering

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i04.5719 ◽

2020 ◽

Vol 34 (04) ◽

pp. 3211-3218

Author(s):

Liang Bai ◽

Jiye Liang

Keyword(s):

Optimization Model ◽

Clustering Algorithms ◽

Complex Structure ◽

Computational Cost ◽

Real Data ◽

Data Sets ◽

Real World Data ◽

Clustering Problem ◽

Efficiency And Effectiveness ◽

Clustering Problems

Due to the complex structure of the real-world data, nonlinearly separable clustering is one of popular and widely studied clustering problems. Currently, various types of algorithms, such as kernel k-means, spectral clustering and density clustering, have been developed to solve this problem. However, it is difficult for them to balance the efficiency and effectiveness of clustering, which limits their real applications. To get rid of the deficiency, we propose a three-level optimization model for nonlinearly separable clustering which divides the clustering problem into three sub-problems: a linearly separable clustering on the object set, a nonlinearly separable clustering on the cluster set and an ensemble clustering on the partition set. An iterative algorithm is proposed to solve the optimization problem. The proposed algorithm can use low computational cost to effectively recognize nonlinearly separable clusters. The performance of this algorithm has been studied on synthetical and real data sets. Comparisons with other nonlinearly separable clustering algorithms illustrate the efficiency and effectiveness of the proposed algorithm.

Download Full-text

A random matrix analysis of random Fourier features: beyond the Gaussian kernel, a precise phase transition, and the corresponding double descent*

Journal of Statistical Mechanics Theory and Experiment ◽

10.1088/1742-5468/ac3a77 ◽

2021 ◽

Vol 2021 (12) ◽

pp. 124006

Author(s):

Zhenyu Liao ◽

Romain Couillet ◽

Michael W Mahoney

Keyword(s):

Phase Transition ◽

Feature Space ◽

Matrix Analysis ◽

Gaussian Kernel ◽

Gram Matrix ◽

Data Sets ◽

Phase Transition Behavior ◽

Real World Data ◽

Random Fourier Features

Abstract This article characterizes the exact asymptotics of random Fourier feature (RFF) regression, in the realistic setting where the number of data samples n, their dimension p, and the dimension of feature space N are all large and comparable. In this regime, the random RFF Gram matrix no longer converges to the well-known limiting Gaussian kernel matrix (as it does when N → ∞ alone), but it still has a tractable behavior that is captured by our analysis. This analysis also provides accurate estimates of training and test regression errors for large n, p, N. Based on these estimates, a precise characterization of two qualitatively different phases of learning, including the phase transition between them, is provided; and the corresponding double descent test error curve is derived from this phase transition behavior. These results do not depend on strong assumptions on the data distribution, and they perfectly match empirical results on real-world data sets.

Download Full-text

The Active Segmentation Platform for Microscopic Image Classification and Segmentation

Brain Sciences ◽

10.3390/brainsci11121645 ◽

2021 ◽

Vol 11 (12) ◽

pp. 1645

Author(s):

Sumit K. Vohra ◽

Dimiter Prodanov

Keyword(s):

Machine Learning ◽

Image Segmentation ◽

Image Classification ◽

Domain Knowledge ◽

Feature Space ◽

Ground Truth ◽

Classification Problem ◽

Data Sets ◽

Learning Approaches ◽

Data Set

Image segmentation still represents an active area of research since no universal solution can be identified. Traditional image segmentation algorithms are problem-specific and limited in scope. On the other hand, machine learning offers an alternative paradigm where predefined features are combined into different classifiers, providing pixel-level classification and segmentation. However, machine learning only can not address the question as to which features are appropriate for a certain classification problem. The article presents an automated image segmentation and classification platform, called Active Segmentation, which is based on ImageJ. The platform integrates expert domain knowledge, providing partial ground truth, with geometrical feature extraction based on multi-scale signal processing combined with machine learning. The approach in image segmentation is exemplified on the ISBI 2012 image segmentation challenge data set. As a second application we demonstrate whole image classification functionality based on the same principles. The approach is exemplified using the HeLa and HEp-2 data sets. Obtained results indicate that feature space enrichment properly balanced with feature selection functionality can achieve performance comparable to deep learning approaches. In summary, differential geometry can substantially improve the outcome of machine learning since it can enrich the underlying feature space with new geometrical invariant objects.

Download Full-text

SVDD-Based Pattern Denoising

Neural Computation ◽

10.1162/neco.2007.19.7.1919 ◽

2007 ◽

Vol 19 (7) ◽

pp. 1919-1938 ◽

Cited By ~ 36

Author(s):

Jooyoung Park ◽

Daesung Kang ◽

Jongho Kim ◽

James T. Kwok ◽

Ivor W. Tsang

Keyword(s):

Test Pattern ◽

Main Idea ◽

Feature Space ◽

Training Data ◽

Support Vector ◽

Support Vector Data Description ◽

Data Sets ◽

Decision Boundary ◽

Vector Data ◽

Real World Data

The support vector data description (SVDD) is one of the best-known one-class support vector learning methods, in which one tries the strategy of using balls defined on the feature space in order to distinguish a set of normal data from all other possible abnormal objects. The major concern of this letter is to extend the main idea of SVDD to pattern denoising. Combining the geodesic projection to the spherical decision boundary resulting from the SVDD, together with solving the preimage problem, we propose a new method for pattern denoising. We first solve SVDD for the training data and then for each noisy test pattern, obtain its denoised feature by moving its feature vector along the geodesic on the manifold to the nearest decision boundary of the SVDD ball. Finally we find the location of the denoised pattern by obtaining the pre-image of the denoised feature. The applicability of the proposed method is illustrated by a number of toy and real-world data sets.

Download Full-text

An improved spectral clustering algorithm based on local neighbors in kernel space

Computer Science and Information Systems ◽

10.2298/csis110415064l ◽

2011 ◽

Vol 8 (4) ◽

pp. 1143-1157 ◽

Cited By ~ 5

Author(s):

Xinyue Liu ◽

Xing Yong ◽

Hongfei Lin

Keyword(s):

Real World ◽

Spectral Clustering ◽

Clustering Algorithm ◽

Sparse Matrix ◽

Feature Space ◽

Data Sets ◽

Kernel Space ◽

Real World Data ◽

World Data ◽

Linear Reconstruction

Similarity matrix is critical to the performance of spectral clustering. Mercer kernels have become popular largely due to its successes in applying kernel methods such as kernel PCA. A novel spectral clustering method is proposed based on local neighborhood in kernel space (SC-LNK), which assumes that each data point can be linearly reconstructed from its neighbors. The SC-LNK algorithm tries to project the data to a feature space by the Mercer kernel, and then learn a sparse matrix using linear reconstruction as the similarity graph for spectral clustering. Experiments have been performed on synthetic and real world data sets and have shown that spectral clustering based on linear reconstruction in kernel space outperforms the conventional spectral clustering and the other two algorithms, especially in real world data sets.

Download Full-text

Label distribution learning with label-specific features

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2019/460 ◽

2019 ◽

Cited By ~ 1

Author(s):

Tingting Ren ◽

Xiuyi Jia ◽

Weiwei Li ◽

Lei Chen ◽

Zechao Li

Keyword(s):

Feature Space ◽

Data Representation ◽

Data Sets ◽

Real World Data ◽

Feature Selection Problem ◽

Label Correlations ◽

The Common ◽

Label Distribution Learning ◽

Class Labels ◽

Label Distribution

Label distribution learning (LDL) is a novel machine learning paradigm to deal with label ambiguity issues by placing more emphasis on how relevant each label is to a particular instance. Many LDL algorithms have been proposed and most of them concentrate on the learning models, while few of them focus on the feature selection problem. All existing LDL models are built on a simple feature space in which all features are shared by all the class labels. However, this kind of traditional data representation strategy tends to select features that are distinguishable for all labels, but ignores label-specific features that are pertinent and discriminative for each class label. In this paper, we propose a novel LDL algorithm by leveraging label-specific features. The common features for all labels and specific features for each label are simultaneously learned to enhance the LDL model. Moreover, we also exploit the label correlations in the proposed LDL model. The experimental results on several real-world data sets validate the effectiveness of our method.

Download Full-text

Supervised Discovery of Unknown Unknowns through Test Sample Mining (Student Abstract)

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i10.7252 ◽

2020 ◽

Vol 34 (10) ◽

pp. 13959-13960

Author(s):

Zheng Wang ◽

Bruno Abrahao ◽

Ece Kamar

Keyword(s):

Feature Space ◽

Training Model ◽

Test Sample ◽

Target Space ◽

Data Sets ◽

Real World Data ◽

Class Prediction ◽

Performance Improvements ◽

Consistent Performance ◽

Set Up

Given a fixed hypothesis space, defined to model class structure in a particular domain of application, unknown unknowns (u.u.s) are data examples that form classes in the feature space whose structure is not represented in a trained model. Accordingly, this leads to incorrect class prediction with high confidence, which represents one of the major sources of blind spots in machine learning. Our method seeks to reduce the structural mismatch between the training model and that of the target space in a supervised way. We illuminate further structure through cross-validation on a modified training model, set up to mine and trap u.u.s in a marginal training class, created from examples of a random sample of the test set. Contrary to previous approaches, our method simplifies the solution, as it does not rely on budgeted queries to an Oracle whose outcomes inform adjustments to training. In addition, our empirically results exhibit consistent performance improvements over baselines, on both synthetic and real-world data sets.

Download Full-text