Supervised Discovery of Unknown Unknowns through Test Sample Mining (Student Abstract)

Given a fixed hypothesis space, defined to model class structure in a particular domain of application, unknown unknowns (u.u.s) are data examples that form classes in the feature space whose structure is not represented in a trained model. Accordingly, this leads to incorrect class prediction with high confidence, which represents one of the major sources of blind spots in machine learning. Our method seeks to reduce the structural mismatch between the training model and that of the target space in a supervised way. We illuminate further structure through cross-validation on a modified training model, set up to mine and trap u.u.s in a marginal training class, created from examples of a random sample of the test set. Contrary to previous approaches, our method simplifies the solution, as it does not rely on budgeted queries to an Oracle whose outcomes inform adjustments to training. In addition, our empirically results exhibit consistent performance improvements over baselines, on both synthetic and real-world data sets.

Download Full-text

An Incremental Classification Algorithm for Mining Data with Feature Space Heterogeneity

Mathematical Problems in Engineering ◽

10.1155/2014/327142 ◽

2014 ◽

Vol 2014 ◽

pp. 1-9 ◽

Cited By ~ 1

Author(s):

Yu Wang

Keyword(s):

Feature Space ◽

Classification Problem ◽

Classification Algorithm ◽

Data Sets ◽

Real World Data ◽

Supervised Clustering ◽

Online Classification ◽

Efficiency And Effectiveness ◽

Feature Relevance ◽

Incremental Classification

Feature space heterogeneity often exists in many real world data sets so that some features are of different importance for classification over different subsets. Moreover, the pattern of feature space heterogeneity might dynamically change over time as more and more data are accumulated. In this paper, we develop an incremental classification algorithm, Supervised Clustering for Classification with Feature Space Heterogeneity (SCCFSH), to address this problem. In our approach, supervised clustering is implemented to obtain a number of clusters such that samples in each cluster are from the same class. After the removal of outliers, relevance of features in each cluster is calculated based on their variations in this cluster. The feature relevance is incorporated into distance calculation for classification. The main advantage of SCCFSH lies in the fact that it is capable of solving a classification problem with feature space heterogeneity in an incremental way, which is favorable for online classification tasks with continuously changing data. Experimental results on a series of data sets and application to a database marketing problem show the efficiency and effectiveness of the proposed approach.

Download Full-text

Preventing Disparate Treatment in Sequential Decision Making

Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2018/311 ◽

2018 ◽

Cited By ~ 1

Author(s):

Hoda Heidari ◽

Andreas Krause

Keyword(s):

Decision Making ◽

Learning Algorithm ◽

Feature Space ◽

Sequential Decision Making ◽

Data Sets ◽

Sequential Decision ◽

Real World Data ◽

Time Step ◽

Job Application ◽

Disparate Treatment

We study fairness in sequential decision making environments, where at each time step a learning algorithm receives data corresponding to a new individual (e.g. a new job application) and must make an irrevocable decision about him/her (e.g. whether to hire the applicant) based on observations made so far. In order to prevent cases of disparate treatment, our time-dependent notion of fairness requires algorithmic decisions to be consistent: if two individuals are similar in the feature space and arrive during the same time epoch, the algorithm must assign them to similar outcomes. We propose a general framework for post-processing predictions made by a black-box learning model, that guarantees the resulting sequence of outcomes is consistent. We show theoretically that imposing consistency will not significantly slow down learning. Our experiments on two real-world data sets illustrate and confirm this finding in practice.

Download Full-text

Feature Selection for Ridge Regression with Provable Guarantees

Neural Computation ◽

10.1162/neco_a_00816 ◽

2016 ◽

Vol 28 (4) ◽

pp. 716-742 ◽

Cited By ~ 9

Author(s):

Saurabh Paul ◽

Petros Drineas

Keyword(s):

Feature Selection ◽

Ridge Regression ◽

Feature Selection Method ◽

Feature Space ◽

Data Sets ◽

Real World Data ◽

Feature Selection Technique ◽

Worst Case ◽

Single Set ◽

Classification Function

We introduce single-set spectral sparsification as a deterministic sampling–based feature selection technique for regularized least-squares classification, which is the classification analog to ridge regression. The method is unsupervised and gives worst-case guarantees of the generalization power of the classification function after feature selection with respect to the classification function obtained using all features. We also introduce leverage-score sampling as an unsupervised randomized feature selection method for ridge regression. We provide risk bounds for both single-set spectral sparsification and leverage-score sampling on ridge regression in the fixed design setting and show that the risk in the sampled space is comparable to the risk in the full-feature space. We perform experiments on synthetic and real-world data sets; a subset of TechTC-300 data sets, to support our theory. Experimental results indicate that the proposed methods perform better than the existing feature selection methods.

Download Full-text

A random matrix analysis of random Fourier features: beyond the Gaussian kernel, a precise phase transition, and the corresponding double descent*

Journal of Statistical Mechanics Theory and Experiment ◽

10.1088/1742-5468/ac3a77 ◽

2021 ◽

Vol 2021 (12) ◽

pp. 124006

Author(s):

Zhenyu Liao ◽

Romain Couillet ◽

Michael W Mahoney

Keyword(s):

Phase Transition ◽

Feature Space ◽

Matrix Analysis ◽

Gaussian Kernel ◽

Gram Matrix ◽

Data Sets ◽

Phase Transition Behavior ◽

Real World Data ◽

Random Fourier Features

Abstract This article characterizes the exact asymptotics of random Fourier feature (RFF) regression, in the realistic setting where the number of data samples n, their dimension p, and the dimension of feature space N are all large and comparable. In this regime, the random RFF Gram matrix no longer converges to the well-known limiting Gaussian kernel matrix (as it does when N → ∞ alone), but it still has a tractable behavior that is captured by our analysis. This analysis also provides accurate estimates of training and test regression errors for large n, p, N. Based on these estimates, a precise characterization of two qualitatively different phases of learning, including the phase transition between them, is provided; and the corresponding double descent test error curve is derived from this phase transition behavior. These results do not depend on strong assumptions on the data distribution, and they perfectly match empirical results on real-world data sets.

Download Full-text

SVDD-Based Pattern Denoising

Neural Computation ◽

10.1162/neco.2007.19.7.1919 ◽

2007 ◽

Vol 19 (7) ◽

pp. 1919-1938 ◽

Cited By ~ 36

Author(s):

Jooyoung Park ◽

Daesung Kang ◽

Jongho Kim ◽

James T. Kwok ◽

Ivor W. Tsang

Keyword(s):

Test Pattern ◽

Main Idea ◽

Feature Space ◽

Training Data ◽

Support Vector ◽

Support Vector Data Description ◽

Data Sets ◽

Decision Boundary ◽

Vector Data ◽

Real World Data

The support vector data description (SVDD) is one of the best-known one-class support vector learning methods, in which one tries the strategy of using balls defined on the feature space in order to distinguish a set of normal data from all other possible abnormal objects. The major concern of this letter is to extend the main idea of SVDD to pattern denoising. Combining the geodesic projection to the spherical decision boundary resulting from the SVDD, together with solving the preimage problem, we propose a new method for pattern denoising. We first solve SVDD for the training data and then for each noisy test pattern, obtain its denoised feature by moving its feature vector along the geodesic on the manifold to the nearest decision boundary of the SVDD ball. Finally we find the location of the denoised pattern by obtaining the pre-image of the denoised feature. The applicability of the proposed method is illustrated by a number of toy and real-world data sets.

Download Full-text

An improved spectral clustering algorithm based on local neighbors in kernel space

Computer Science and Information Systems ◽

10.2298/csis110415064l ◽

2011 ◽

Vol 8 (4) ◽

pp. 1143-1157 ◽

Cited By ~ 5

Author(s):

Xinyue Liu ◽

Xing Yong ◽

Hongfei Lin

Keyword(s):

Real World ◽

Spectral Clustering ◽

Clustering Algorithm ◽

Sparse Matrix ◽

Feature Space ◽

Data Sets ◽

Kernel Space ◽

Real World Data ◽

World Data ◽

Linear Reconstruction

Similarity matrix is critical to the performance of spectral clustering. Mercer kernels have become popular largely due to its successes in applying kernel methods such as kernel PCA. A novel spectral clustering method is proposed based on local neighborhood in kernel space (SC-LNK), which assumes that each data point can be linearly reconstructed from its neighbors. The SC-LNK algorithm tries to project the data to a feature space by the Mercer kernel, and then learn a sparse matrix using linear reconstruction as the similarity graph for spectral clustering. Experiments have been performed on synthetic and real world data sets and have shown that spectral clustering based on linear reconstruction in kernel space outperforms the conventional spectral clustering and the other two algorithms, especially in real world data sets.

Download Full-text

Label distribution learning with label-specific features

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2019/460 ◽

2019 ◽

Cited By ~ 1

Author(s):

Tingting Ren ◽

Xiuyi Jia ◽

Weiwei Li ◽

Lei Chen ◽

Zechao Li

Keyword(s):

Feature Space ◽

Data Representation ◽

Data Sets ◽

Real World Data ◽

Feature Selection Problem ◽

Label Correlations ◽

The Common ◽

Label Distribution Learning ◽

Class Labels ◽

Label Distribution

Label distribution learning (LDL) is a novel machine learning paradigm to deal with label ambiguity issues by placing more emphasis on how relevant each label is to a particular instance. Many LDL algorithms have been proposed and most of them concentrate on the learning models, while few of them focus on the feature selection problem. All existing LDL models are built on a simple feature space in which all features are shared by all the class labels. However, this kind of traditional data representation strategy tends to select features that are distinguishable for all labels, but ignores label-specific features that are pertinent and discriminative for each class label. In this paper, we propose a novel LDL algorithm by leveraging label-specific features. The common features for all labels and specific features for each label are simultaneously learned to enhance the LDL model. Moreover, we also exploit the label correlations in the proposed LDL model. The experimental results on several real-world data sets validate the effectiveness of our method.

Download Full-text

TrustSVD: A Novel Trust-Based Matrix Factorization Model with User Trust and Item Ratings

International Journal of Advanced Research in Computer Science and Software Engineering ◽

10.23956/ijarcsse.v7i11.422 ◽

2017 ◽

Vol 7 (11) ◽

pp. 7 ◽

Cited By ~ 1

Author(s):

K Sobha Rani

Keyword(s):

Matrix Factorization ◽

Social Trust ◽

State Of The Art ◽

Data Sets ◽

Real World Data ◽

Recommendation Algorithm ◽

Active User ◽

Factorization Model ◽

The Social ◽

Matrix Factorization Technique

Collaborative filtering suffers from the problems of data sparsity and cold start, which dramatically degrade recommendation performance. To help resolve these issues, we propose TrustSVD, a trust-based matrix factorization technique. By analyzing the social trust data from four real-world data sets, we conclude that not only the explicit but also the implicit influence of both ratings and trust should be taken into consideration in a recommendation model. Hence, we build on top of a state-of-the-art recommendation algorithm SVD++ which inherently involves the explicit and implicit influence of rated items, by further incorporating both the explicit and implicit influence of trusted users on the prediction of items for an active user. To our knowledge, the work reported is the first to extend SVD++ with social trust information. Experimental results on the four data sets demonstrate that our approach TrustSVD achieves better accuracy than other ten counterparts, and can better handle the concerned issues.

Download Full-text

Hfinger: Malware HTTP Request Fingerprinting

Entropy ◽

10.3390/e23050507 ◽

2021 ◽

Vol 23 (5) ◽

pp. 507

Author(s):

Piotr Białczak ◽

Wojciech Mazurczyk

Keyword(s):

Real World ◽

Network Traffic ◽

Experimental Evaluation ◽

Data Sets ◽

Real World Data ◽

Malicious Software ◽

Default Mode ◽

World Data ◽

Effectiveness Analysis ◽

Http Protocol

Malicious software utilizes HTTP protocol for communication purposes, creating network traffic that is hard to identify as it blends into the traffic generated by benign applications. To this aim, fingerprinting tools have been developed to help track and identify such traffic by providing a short representation of malicious HTTP requests. However, currently existing tools do not analyze all information included in the HTTP message or analyze it insufficiently. To address these issues, we propose Hfinger, a novel malware HTTP request fingerprinting tool. It extracts information from the parts of the request such as URI, protocol information, headers, and payload, providing a concise request representation that preserves the extracted information in a form interpretable by a human analyst. For the developed solution, we have performed an extensive experimental evaluation using real-world data sets and we also compared Hfinger with the most related and popular existing tools such as FATT, Mercury, and p0f. The conducted effectiveness analysis reveals that on average only 1.85% of requests fingerprinted by Hfinger collide between malware families, what is 8–34 times lower than existing tools. Moreover, unlike these tools, in default mode, Hfinger does not introduce collisions between malware and benign applications and achieves it by increasing the number of fingerprints by at most 3 times. As a result, Hfinger can effectively track and hunt malware by providing more unique fingerprints than other standard tools.

Download Full-text

Classification of jujube defects in small data sets based on transfer learning

Neural Computing and Applications ◽

10.1007/s00521-021-05715-2 ◽

2021 ◽

Author(s):

Jianping Ju ◽

Hong Zheng ◽

Xiaohang Xu ◽

Zhongyuan Guo ◽

Zhaohui Zheng ◽

...

Keyword(s):

Transfer Learning ◽

Loss Function ◽

Training Model ◽

Parameter Distribution ◽

Test Accuracy ◽

Small Data ◽

Data Sets ◽

Data Set ◽

Small Data Sets

AbstractAlthough convolutional neural networks have achieved success in the field of image classification, there are still challenges in the field of agricultural product quality sorting such as machine vision-based jujube defects detection. The performance of jujube defect detection mainly depends on the feature extraction and the classifier used. Due to the diversity of the jujube materials and the variability of the testing environment, the traditional method of manually extracting the features often fails to meet the requirements of practical application. In this paper, a jujube sorting model in small data sets based on convolutional neural network and transfer learning is proposed to meet the actual demand of jujube defects detection. Firstly, the original images collected from the actual jujube sorting production line were pre-processed, and the data were augmented to establish a data set of five categories of jujube defects. The original CNN model is then improved by embedding the SE module and using the triplet loss function and the center loss function to replace the softmax loss function. Finally, the depth pre-training model on the ImageNet image data set was used to conduct training on the jujube defects data set, so that the parameters of the pre-training model could fit the parameter distribution of the jujube defects image, and the parameter distribution was transferred to the jujube defects data set to complete the transfer of the model and realize the detection and classification of the jujube defects. The classification results are visualized by heatmap through the analysis of classification accuracy and confusion matrix compared with the comparison models. The experimental results show that the SE-ResNet50-CL model optimizes the fine-grained classification problem of jujube defect recognition, and the test accuracy reaches 94.15%. The model has good stability and high recognition accuracy in complex environments.

Download Full-text