The state‐of‐the‐art on tours for dynamic visualization of high‐dimensional data

This article describes how recently, because of the curse of dimensionality in high dimensional data, a significant amount of research has been conducted on subspace clustering aiming at discovering clusters embedded in any possible attributes combination. The main goal of subspace clustering algorithms is to find all clusters in all subspaces. Previous studies have mostly been generating redundant subspace clusters, leading to clustering accuracy loss and also increasing the running time of the algorithms. A bottom-up density-based approach is suggested in this article, in which the cluster structure serves as a similarity measure to generate the optimal subspaces which result in raising the accuracy of the subspace clustering. Based on this idea, the algorithm discovers similar subspaces by considering similarity in their cluster structure, then combines them and the data in the new subspaces would be clustered again. Finally, the algorithm determines all the subspaces and also finds all clusters within them. Experiments on various synthetic and real datasets show that the results of the proposed approach are significantly better in quality and runtime than the state-of-the-art on clustering high-dimensional data.

Download Full-text

Utility metric for unsupervised feature selection

PeerJ Computer Science ◽

10.7717/peerj-cs.477 ◽

2021 ◽

Vol 7 ◽

pp. e477

Author(s):

Amalia Villa ◽

Abhijith Mundanad Narayanan ◽

Sabine Van Huffel ◽

Alexander Bertrand ◽

Carolina Varon

Keyword(s):

Feature Selection ◽

Manifold Learning ◽

State Of The Art ◽

High Dimensional Data ◽

Subset Selection ◽

The State ◽

Computational Time ◽

High Dimensional ◽

Learning Stage ◽

Unsupervised Feature Selection

Feature selection techniques are very useful approaches for dimensionality reduction in data analysis. They provide interpretable results by reducing the dimensions of the data to a subset of the original set of features. When the data lack annotations, unsupervised feature selectors are required for their analysis. Several algorithms for this aim exist in the literature, but despite their large applicability, they can be very inaccessible or cumbersome to use, mainly due to the need for tuning non-intuitive parameters and the high computational demands. In this work, a publicly available ready-to-use unsupervised feature selector is proposed, with comparable results to the state-of-the-art at a much lower computational cost. The suggested approach belongs to the methods known as spectral feature selectors. These methods generally consist of two stages: manifold learning and subset selection. In the first stage, the underlying structures in the high-dimensional data are extracted, while in the second stage a subset of the features is selected to replicate these structures. This paper suggests two contributions to this field, related to each of the stages involved. In the manifold learning stage, the effect of non-linearities in the data is explored, making use of a radial basis function (RBF) kernel, for which an alternative solution for the estimation of the kernel parameter is presented for cases with high-dimensional data. Additionally, the use of a backwards greedy approach based on the least-squares utility metric for the subset selection stage is proposed. The combination of these new ingredients results in the utility metric for unsupervised feature selection U2FS algorithm. The proposed U2FS algorithm succeeds in selecting the correct features in a simulation environment. In addition, the performance of the method on benchmark datasets is comparable to the state-of-the-art, while requiring less computational time. Moreover, unlike the state-of-the-art, U2FS does not require any tuning of parameters.

Download Full-text

Dual Query: Practical Private Query Release for High Dimensional Data

Journal of Privacy and Confidentiality ◽

10.29012/jpc.v7i2.650 ◽

2017 ◽

Vol 7 (2) ◽

Author(s):

Marco Gaboardi ◽

Emilio Jesús Gallego Arias ◽

Justin Hsu ◽

Aaron Roth ◽

Zhiwei Steven Wu

Keyword(s):

State Of The Art ◽

High Dimensional Data ◽

Integer Program ◽

The State ◽

High Dimensional ◽

Worst Case ◽

Case Complexity ◽

Worst Case Complexity ◽

Multiple Orders ◽

High Dimensional Datasets

We present a practical, differentially private algorithm for answering a large number of queries on high dimensional datasets. Like all algorithms for this task, ours necessarily has worst-case complexity exponential in the dimension of the data. However, our algorithm packages the computationally hard step into a concisely defined integer program, which can be solved non-privately using standard solvers. We prove accuracy and privacy theorems for our algorithm, and then demonstrate experimentally that our algorithm performs well in practice. For example, our algorithm can efficiently and accurately answer millions of queries on the Netflix dataset, which has over 17,000 attributes; this is an improvement on the state of the art by multiple orders of magnitude.

Download Full-text

Learning in high-dimensional multimedia data: the state of the art

Multimedia Systems ◽

10.1007/s00530-015-0494-1 ◽

2015 ◽

Vol 23 (3) ◽

pp. 303-313 ◽

Cited By ~ 46

Author(s):

Lianli Gao ◽

Jingkuan Song ◽

Xingyi Liu ◽

Junming Shao ◽

Jiajun Liu ◽

...

Keyword(s):

State Of The Art ◽

The State ◽

Multimedia Data ◽

High Dimensional

Download Full-text

Navo Minority Over-sampling Technique (NMOTe): A Consistent Performance Booster on Imbalanced Datasets

Journal of Electronics and Informatics - September 2019 ◽

10.36548/jei.2020.2.004 ◽

2020 ◽

Vol 2 (2) ◽

pp. 96-136

Author(s):

Navoneel Chakrabarty ◽

Sanket Biswas

Keyword(s):

State Of The Art ◽

High Dimensional Data ◽

Optimal Solution ◽

Sampling Technique ◽

High Dimensional ◽

Real World Data ◽

Imbalanced Datasets ◽

Comprehensive Overview ◽

Unequal Distribution ◽

Data Imbalance

Imbalanced data refers to a problem in machine learning where there exists unequal distribution of instances for each classes. Performing a classification task on such data can often turn bias in favour of the majority class. The bias gets multiplied in cases of high dimensional data. To settle this problem, there exists many real-world data mining techniques like over-sampling and under-sampling, which can reduce the Data Imbalance. Synthetic Minority Oversampling Technique (SMOTe) provided one such state-of-the-art and popular solution to tackle class imbalancing, even on high-dimensional data platform. In this work, a novel and consistent oversampling algorithm has been proposed that can further enhance the performance of classification, especially on binary imbalanced datasets. It has been named as NMOTe (Navo Minority Oversampling Technique), an upgraded and superior alternative to the existing techniques. A critical analysis and comprehensive overview on the literature has been done to get a deeper insight into the problem statements and nurturing the need to obtain the most optimal solution. The performance of NMOTe on some standard datasets has been established in this work to get a statistical understanding on why it has edged the existing state-of-the-art to become the most robust technique for solving the two-class data imbalance problem.

Download Full-text

Data synthesis via differentially private markov random fields

Proceedings of the VLDB Endowment ◽

10.14778/3476249.3476272 ◽

2021 ◽

Vol 14 (11) ◽

pp. 2190-2202

Author(s):

Kuntai Cai ◽

Xiaoyu Lei ◽

Jianxin Wei ◽

Xiaokui Xiao

Keyword(s):

Markov Random Fields ◽

Input Data ◽

Differential Privacy ◽

State Of The Art ◽

Synthetic Data ◽

The State ◽

High Dimensional ◽

Data Synthesis ◽

Markov Random ◽

Low Dimensional

This paper studies the synthesis of high-dimensional datasets with differential privacy (DP). The state-of-the-art solution addresses this problem by first generating a set M of noisy low-dimensional marginals of the input data D , and then use them to approximate the data distribution in D for synthetic data generation. However, it imposes several constraints on M that considerably limits the choices of marginals. This makes it difficult to capture all important correlations among attributes, which in turn degrades the quality of the resulting synthetic data. To address the above deficiency, we propose PrivMRF, a method that (i) also utilizes a set M of low-dimensional marginals for synthesizing high-dimensional data with DP, but (ii) provides a high degree of flexibility in the choices of marginals. The key idea of PrivMRF is to select an appropriate M to construct a Markov random field (MRF) that models the correlations among the attributes in the input data, and then use the MRF for data synthesis. Experimental results on four benchmark datasets show that PrivMRF consistently outperforms the state of the art in terms of the accuracy of counting queries and classification tasks conducted on the synthetic data generated.

Download Full-text

Evaluating Top-k Skyline Queries Efficiently

Advanced Database Query Systems - Advances in Data Mining and Database Management ◽

10.4018/978-1-60960-475-2.ch004 ◽

2011 ◽

pp. 102-117

Author(s):

Marlene Goncalves ◽

María Esther Vidal

Keyword(s):

State Of The Art ◽

Search Space ◽

Large Datasets ◽

The State ◽

Experimental Results ◽

High Dimensional ◽

Skyline Queries ◽

Speed Up

Criteria that induce a Skyline naturally represent user’s preference conditions useful to discard irrelevant data in large datasets. However, in the presence of high-dimensional Skyline spaces, the size of the Skyline can still be very large. To identify the best k points among the Skyline, the Top-k Skyline approach has been proposed. This chapter describes existing solutions and proposes to use the TKSI algorithm for the Top-k Skyline problem. TKSI reduces the search space by computing only a subset of the Skyline that is required to produce the top-k objects. In addition, the Skyline Frequency Metric is implemented to discriminate among the Skyline objects those that best meet the multidimensional criteria. This chapter’s authors have empirically studied the quality of TKSI, and their experimental results show the TKSI may be able to speed up the computation of the Top-k Skyline in at least 50% percent with regard to the state-of-the-art solutions.

Download Full-text

Projection with Gaussian Kernel for Person Re-Identification

Journal of Advanced Computational Intelligence and Intelligent Informatics ◽

10.20965/jaciii.2020.p0638 ◽

2020 ◽

Vol 24 (5) ◽

pp. 638-647

Author(s):

Dao Nam Anh ◽

Thuy-Binh Nguyen ◽

Thi-Lan Le ◽

◽

...

Keyword(s):

State Of The Art ◽

Dimensional Space ◽

The State ◽

Gaussian Kernel ◽

High Dimensional ◽

Local Geometry ◽

Single Shot ◽

Camera Network ◽

Art Methods ◽

Different Characteristics

Person re-identification (ReID), the task of associating the detected images of a person as he/she moves in a non-overlapping camera network, is faced with different challenges including variations in the illumination, view-point and occlusion. To ensure good performance for person ReID, the state-of-the-art methods have leveraged different characteristics for person representation. As a result, a high-dimensional feature vector is extracted and used in the person matching step. However, each feature plays a specific role for distinguishing one person from the others. This paper proposes a method for person ReID wherein the correspondences between descriptors in high-dimensional space can be achieved via explicit feature selection and appropriate projection with a Gaussian kernel. The advantage of the proposed method is that it allows simultaneous matching of the descriptors while preserving the local geometry of the manifolds. Different experiments were conducted on both single-shot and multi-shot person ReID datasets. The experimental results demonstrates that the proposed method outperforms the state-of-the-art methods.

Download Full-text

A Hierarchical Gamma Mixture Model-Based Method for Classification of High-Dimensional Data

Entropy ◽

10.3390/e21090906 ◽

2019 ◽

Vol 21 (9) ◽

pp. 906

Author(s):

Muhammad Azhar ◽

Mark Junjie Li ◽

Joshua Zhexue Huang

Keyword(s):

Mixture Model ◽

State Of The Art ◽

High Dimensional Data ◽

Subspace Clustering ◽

High Dimensional ◽

Data Sets ◽

Ensemble Clustering ◽

Class Label ◽

Number Of Clusters ◽

Number Of Classes

Data classification is an important research topic in the field of data mining. With the rapid development in social media sites and IoT devices, data have grown tremendously in volume and complexity, which has resulted in a lot of large and complex high-dimensional data. Classifying such high-dimensional complex data with a large number of classes has been a great challenge for current state-of-the-art methods. This paper presents a novel, hierarchical, gamma mixture model-based unsupervised method for classifying high-dimensional data with a large number of classes. In this method, we first partition the features of the dataset into feature strata by using k-means. Then, a set of subspace data sets is generated from the feature strata by using the stratified subspace sampling method. After that, the GMM Tree algorithm is used to identify the number of clusters and initial clusters in each subspace dataset and passing these initial cluster centers to k-means to generate base subspace clustering results. Then, the subspace clustering result is integrated into an object cluster association (OCA) matrix by using the link-based method. The ensemble clustering result is generated from the OCA matrix by the k-means algorithm with the number of clusters identified by the GMM Tree algorithm. After producing the ensemble clustering result, the dominant class label is assigned to each cluster after computing the purity. A classification is made on the object by computing the distance between the new object and the center of each cluster in the classifier, and the class label of the cluster is assigned to the new object which has the shortest distance. A series of experiments were conducted on twelve synthetic and eight real-world data sets, with different numbers of classes, features, and objects. The experimental results have shown that the new method outperforms other state-of-the-art techniques to classify data in most of the data sets.

Download Full-text

Vibration-Based Outlier Detection on High Dimensional Data

International Journal of Artificial Intelligence Tools ◽

10.1142/s0218213016500135 ◽

2016 ◽

Vol 25 (03) ◽

pp. 1650013

Author(s):

Shuyin Xia ◽

Guoyin Wang ◽

Hong Yu ◽

Qun Liu ◽

Jin Wang

Keyword(s):

Outlier Detection ◽

Time Complexity ◽

State Of The Art ◽

High Dimensional Data ◽

Difficult Problem ◽

Index Structure ◽

High Dimensional ◽

Basic Model ◽

Traditional Approaches ◽

Better Than

Outlier detection is a difficult problem due to its time complexity being quadratic or cube in most cases, which makes it necessary to develop corresponding acceleration algorithms. Since the index structure (c.f. R tree) is used in the main acceleration algorithms, those approaches deteriorate when the dimensionality increases. In this paper, an approach named VBOD (vibration-based outlier detection) is proposed, in which the main variants assess the vibration. Since the basic model and approximation algorithm FASTVBOD do not need to compute the index structure, their performances are less sensitive to increasing dimensions than traditional approaches. The basic model of this approach has only quadratic time complexity. Furthermore, accelerated algorithms decrease time complexity to [Formula: see text]. The fact that this approach does not rely on any parameter selection is another advantage. FASTVBOD was compared with other state-of-the-art algorithms, and it performed much better than other methods especially on high dimensional data.

Download Full-text

The state‐of‐the‐art on tours for dynamic visualization of high‐dimensional data

Subspace Clustering for High-Dimensional Data Using Cluster Structure Similarity

Utility metric for unsupervised feature selection

Dual Query: Practical Private Query Release for High Dimensional Data

Learning in high-dimensional multimedia data: the state of the art

Navo Minority Over-sampling Technique (NMOTe): A Consistent Performance Booster on Imbalanced Datasets

Data synthesis via differentially private markov random fields

Evaluating Top-k Skyline Queries Efficiently

Projection with Gaussian Kernel for Person Re-Identification

A Hierarchical Gamma Mixture Model-Based Method for Classification of High-Dimensional Data

Vibration-Based Outlier Detection on High Dimensional Data

Export Citation Format