The state‐of‐the‐art on tours for dynamic visualization of high‐dimensional data

Author(s):  
Stuart Lee ◽  
Dianne Cook ◽  
Natalia Silva ◽  
Ursula Laa ◽  
Nicholas Spyrison ◽  
...  
2018 ◽  
Vol 14 (3) ◽  
pp. 38-55 ◽  
Author(s):  
Kavan Fatehi ◽  
Mohsen Rezvani ◽  
Mansoor Fateh ◽  
Mohammad-Reza Pajoohan

This article describes how recently, because of the curse of dimensionality in high dimensional data, a significant amount of research has been conducted on subspace clustering aiming at discovering clusters embedded in any possible attributes combination. The main goal of subspace clustering algorithms is to find all clusters in all subspaces. Previous studies have mostly been generating redundant subspace clusters, leading to clustering accuracy loss and also increasing the running time of the algorithms. A bottom-up density-based approach is suggested in this article, in which the cluster structure serves as a similarity measure to generate the optimal subspaces which result in raising the accuracy of the subspace clustering. Based on this idea, the algorithm discovers similar subspaces by considering similarity in their cluster structure, then combines them and the data in the new subspaces would be clustered again. Finally, the algorithm determines all the subspaces and also finds all clusters within them. Experiments on various synthetic and real datasets show that the results of the proposed approach are significantly better in quality and runtime than the state-of-the-art on clustering high-dimensional data.


2021 ◽  
Vol 7 ◽  
pp. e477
Author(s):  
Amalia Villa ◽  
Abhijith Mundanad Narayanan ◽  
Sabine Van Huffel ◽  
Alexander Bertrand ◽  
Carolina Varon

Feature selection techniques are very useful approaches for dimensionality reduction in data analysis. They provide interpretable results by reducing the dimensions of the data to a subset of the original set of features. When the data lack annotations, unsupervised feature selectors are required for their analysis. Several algorithms for this aim exist in the literature, but despite their large applicability, they can be very inaccessible or cumbersome to use, mainly due to the need for tuning non-intuitive parameters and the high computational demands. In this work, a publicly available ready-to-use unsupervised feature selector is proposed, with comparable results to the state-of-the-art at a much lower computational cost. The suggested approach belongs to the methods known as spectral feature selectors. These methods generally consist of two stages: manifold learning and subset selection. In the first stage, the underlying structures in the high-dimensional data are extracted, while in the second stage a subset of the features is selected to replicate these structures. This paper suggests two contributions to this field, related to each of the stages involved. In the manifold learning stage, the effect of non-linearities in the data is explored, making use of a radial basis function (RBF) kernel, for which an alternative solution for the estimation of the kernel parameter is presented for cases with high-dimensional data. Additionally, the use of a backwards greedy approach based on the least-squares utility metric for the subset selection stage is proposed. The combination of these new ingredients results in the utility metric for unsupervised feature selection U2FS algorithm. The proposed U2FS algorithm succeeds in selecting the correct features in a simulation environment. In addition, the performance of the method on benchmark datasets is comparable to the state-of-the-art, while requiring less computational time. Moreover, unlike the state-of-the-art, U2FS does not require any tuning of parameters.


2017 ◽  
Vol 7 (2) ◽  
Author(s):  
Marco Gaboardi ◽  
Emilio Jesús Gallego Arias ◽  
Justin Hsu ◽  
Aaron Roth ◽  
Zhiwei Steven Wu

We present a practical, differentially private algorithm for answering a large number of queries on high dimensional datasets. Like all algorithms for this task, ours necessarily has worst-case complexity exponential in the dimension of the data. However, our algorithm packages the computationally hard step into a concisely defined integer program, which can be solved non-privately using standard solvers. We prove accuracy and privacy theorems for our algorithm, and then demonstrate experimentally that our algorithm performs well in practice. For example, our algorithm can efficiently and accurately answer millions of queries on the Netflix dataset, which has over 17,000 attributes; this is an improvement on the state of the art by multiple orders of magnitude.


2015 ◽  
Vol 23 (3) ◽  
pp. 303-313 ◽  
Author(s):  
Lianli Gao ◽  
Jingkuan Song ◽  
Xingyi Liu ◽  
Junming Shao ◽  
Jiajun Liu ◽  
...  

2020 ◽  
Vol 2 (2) ◽  
pp. 96-136
Author(s):  
Navoneel Chakrabarty ◽  
Sanket Biswas

Imbalanced data refers to a problem in machine learning where there exists unequal distribution of instances for each classes. Performing a classification task on such data can often turn bias in favour of the majority class. The bias gets multiplied in cases of high dimensional data. To settle this problem, there exists many real-world data mining techniques like over-sampling and under-sampling, which can reduce the Data Imbalance. Synthetic Minority Oversampling Technique (SMOTe) provided one such state-of-the-art and popular solution to tackle class imbalancing, even on high-dimensional data platform. In this work, a novel and consistent oversampling algorithm has been proposed that can further enhance the performance of classification, especially on binary imbalanced datasets. It has been named as NMOTe (Navo Minority Oversampling Technique), an upgraded and superior alternative to the existing techniques. A critical analysis and comprehensive overview on the literature has been done to get a deeper insight into the problem statements and nurturing the need to obtain the most optimal solution. The performance of NMOTe on some standard datasets has been established in this work to get a statistical understanding on why it has edged the existing state-of-the-art to become the most robust technique for solving the two-class data imbalance problem.


2021 ◽  
Vol 14 (11) ◽  
pp. 2190-2202
Author(s):  
Kuntai Cai ◽  
Xiaoyu Lei ◽  
Jianxin Wei ◽  
Xiaokui Xiao

This paper studies the synthesis of high-dimensional datasets with differential privacy (DP). The state-of-the-art solution addresses this problem by first generating a set M of noisy low-dimensional marginals of the input data D , and then use them to approximate the data distribution in D for synthetic data generation. However, it imposes several constraints on M that considerably limits the choices of marginals. This makes it difficult to capture all important correlations among attributes, which in turn degrades the quality of the resulting synthetic data. To address the above deficiency, we propose PrivMRF, a method that (i) also utilizes a set M of low-dimensional marginals for synthesizing high-dimensional data with DP, but (ii) provides a high degree of flexibility in the choices of marginals. The key idea of PrivMRF is to select an appropriate M to construct a Markov random field (MRF) that models the correlations among the attributes in the input data, and then use the MRF for data synthesis. Experimental results on four benchmark datasets show that PrivMRF consistently outperforms the state of the art in terms of the accuracy of counting queries and classification tasks conducted on the synthetic data generated.


Author(s):  
Marlene Goncalves ◽  
María Esther Vidal

Criteria that induce a Skyline naturally represent user’s preference conditions useful to discard irrelevant data in large datasets. However, in the presence of high-dimensional Skyline spaces, the size of the Skyline can still be very large. To identify the best k points among the Skyline, the Top-k Skyline approach has been proposed. This chapter describes existing solutions and proposes to use the TKSI algorithm for the Top-k Skyline problem. TKSI reduces the search space by computing only a subset of the Skyline that is required to produce the top-k objects. In addition, the Skyline Frequency Metric is implemented to discriminate among the Skyline objects those that best meet the multidimensional criteria. This chapter’s authors have empirically studied the quality of TKSI, and their experimental results show the TKSI may be able to speed up the computation of the Top-k Skyline in at least 50% percent with regard to the state-of-the-art solutions.


Author(s):  
Dao Nam Anh ◽  
Thuy-Binh Nguyen ◽  
Thi-Lan Le ◽  
◽  
◽  
...  

Person re-identification (ReID), the task of associating the detected images of a person as he/she moves in a non-overlapping camera network, is faced with different challenges including variations in the illumination, view-point and occlusion. To ensure good performance for person ReID, the state-of-the-art methods have leveraged different characteristics for person representation. As a result, a high-dimensional feature vector is extracted and used in the person matching step. However, each feature plays a specific role for distinguishing one person from the others. This paper proposes a method for person ReID wherein the correspondences between descriptors in high-dimensional space can be achieved via explicit feature selection and appropriate projection with a Gaussian kernel. The advantage of the proposed method is that it allows simultaneous matching of the descriptors while preserving the local geometry of the manifolds. Different experiments were conducted on both single-shot and multi-shot person ReID datasets. The experimental results demonstrates that the proposed method outperforms the state-of-the-art methods.


Entropy ◽  
2019 ◽  
Vol 21 (9) ◽  
pp. 906
Author(s):  
Muhammad Azhar ◽  
Mark Junjie Li ◽  
Joshua Zhexue Huang

Data classification is an important research topic in the field of data mining. With the rapid development in social media sites and IoT devices, data have grown tremendously in volume and complexity, which has resulted in a lot of large and complex high-dimensional data. Classifying such high-dimensional complex data with a large number of classes has been a great challenge for current state-of-the-art methods. This paper presents a novel, hierarchical, gamma mixture model-based unsupervised method for classifying high-dimensional data with a large number of classes. In this method, we first partition the features of the dataset into feature strata by using k-means. Then, a set of subspace data sets is generated from the feature strata by using the stratified subspace sampling method. After that, the GMM Tree algorithm is used to identify the number of clusters and initial clusters in each subspace dataset and passing these initial cluster centers to k-means to generate base subspace clustering results. Then, the subspace clustering result is integrated into an object cluster association (OCA) matrix by using the link-based method. The ensemble clustering result is generated from the OCA matrix by the k-means algorithm with the number of clusters identified by the GMM Tree algorithm. After producing the ensemble clustering result, the dominant class label is assigned to each cluster after computing the purity. A classification is made on the object by computing the distance between the new object and the center of each cluster in the classifier, and the class label of the cluster is assigned to the new object which has the shortest distance. A series of experiments were conducted on twelve synthetic and eight real-world data sets, with different numbers of classes, features, and objects. The experimental results have shown that the new method outperforms other state-of-the-art techniques to classify data in most of the data sets.


2016 ◽  
Vol 25 (03) ◽  
pp. 1650013
Author(s):  
Shuyin Xia ◽  
Guoyin Wang ◽  
Hong Yu ◽  
Qun Liu ◽  
Jin Wang

Outlier detection is a difficult problem due to its time complexity being quadratic or cube in most cases, which makes it necessary to develop corresponding acceleration algorithms. Since the index structure (c.f. R tree) is used in the main acceleration algorithms, those approaches deteriorate when the dimensionality increases. In this paper, an approach named VBOD (vibration-based outlier detection) is proposed, in which the main variants assess the vibration. Since the basic model and approximation algorithm FASTVBOD do not need to compute the index structure, their performances are less sensitive to increasing dimensions than traditional approaches. The basic model of this approach has only quadratic time complexity. Furthermore, accelerated algorithms decrease time complexity to [Formula: see text]. The fact that this approach does not rely on any parameter selection is another advantage. FASTVBOD was compared with other state-of-the-art algorithms, and it performed much better than other methods especially on high dimensional data.


Sign in / Sign up

Export Citation Format

Share Document