Machine learning of high dimensional data on a noisy quantum processor

AbstractQuantum kernel methods show promise for accelerating data analysis by efficiently learning relationships between input data points that have been encoded into an exponentially large Hilbert space. While this technique has been used successfully in small-scale experiments on synthetic datasets, the practical challenges of scaling to large circuits on noisy hardware have not been thoroughly addressed. Here, we present our findings from experimentally implementing a quantum kernel classifier on real high-dimensional data taken from the domain of cosmology using Google’s universal quantum processor, Sycamore. We construct a circuit ansatz that preserves kernel magnitudes that typically otherwise vanish due to an exponentially growing Hilbert space, and implement error mitigation specific to the task of computing quantum kernels on near-term hardware. Our experiment utilizes 17 qubits to classify uncompressed 67 dimensional data resulting in classification accuracy on a test set that is comparable to noiseless simulation.

Download Full-text

Data Quality Measures and Efficient Evaluation Algorithms for Large-Scale High-Dimensional Data

Applied Sciences ◽

10.3390/app11020472 ◽

2021 ◽

Vol 11 (2) ◽

pp. 472

Author(s):

Hyeongmin Cho ◽

Sangkyun Lee

Keyword(s):

Machine Learning ◽

Data Quality ◽

Large Scale ◽

High Dimensional Data ◽

Quality Measures ◽

Training Data ◽

Measure Data ◽

High Dimensional ◽

Small Scale ◽

Class Separability

Machine learning has been proven to be effective in various application areas, such as object and speech recognition on mobile systems. Since a critical key to machine learning success is the availability of large training data, many datasets are being disclosed and published online. From a data consumer or manager point of view, measuring data quality is an important first step in the learning process. We need to determine which datasets to use, update, and maintain. However, not many practical ways to measure data quality are available today, especially when it comes to large-scale high-dimensional data, such as images and videos. This paper proposes two data quality measures that can compute class separability and in-class variability, the two important aspects of data quality, for a given dataset. Classical data quality measures tend to focus only on class separability; however, we suggest that in-class variability is another important data quality factor. We provide efficient algorithms to compute our quality measures based on random projections and bootstrapping with statistical benefits on large-scale high-dimensional data. In experiments, we show that our measures are compatible with classical measures on small-scale data and can be computed much more efficiently on large-scale high-dimensional datasets.

Download Full-text

Mahalanobis distance informed by clustering

Information and Inference A Journal of the IMA ◽

10.1093/imaiai/iay011 ◽

2018 ◽

Vol 8 (2) ◽

pp. 377-406

Author(s):

Almog Lahav ◽

Ronen Talmon ◽

Yuval Kluger

Keyword(s):

Mahalanobis Distance ◽

High Dimensional Data ◽

Hidden Variables ◽

Real Data ◽

Risk Groups ◽

High Dimensional ◽

Data Sets ◽

Kaplan Meier ◽

Data Points ◽

Survival Plot

Abstract A fundamental question in data analysis, machine learning and signal processing is how to compare between data points. The choice of the distance metric is specifically challenging for high-dimensional data sets, where the problem of meaningfulness is more prominent (e.g. the Euclidean distance between images). In this paper, we propose to exploit a property of high-dimensional data that is usually ignored, which is the structure stemming from the relationships between the coordinates. Specifically, we show that organizing similar coordinates in clusters can be exploited for the construction of the Mahalanobis distance between samples. When the observable samples are generated by a nonlinear transformation of hidden variables, the Mahalanobis distance allows the recovery of the Euclidean distances in the hidden space. We illustrate the advantage of our approach on a synthetic example where the discovery of clusters of correlated coordinates improves the estimation of the principal directions of the samples. Our method was applied to real data of gene expression for lung adenocarcinomas (lung cancer). By using the proposed metric we found a partition of subjects to risk groups with a good separation between their Kaplan–Meier survival plot.

Download Full-text

A Novel Density-based Technique for Outlier Detection of High Dimensional Data Utilizing Full Feature Space

Information Technology And Control ◽

10.5755/j01.itc.50.1.25588 ◽

2021 ◽

Vol 50 (1) ◽

pp. 138-152

Author(s):

Mujeeb Ur Rehman ◽

Dost Muhammad Khan

Keyword(s):

Data Mining ◽

Outlier Detection ◽

High Dimensional Data ◽

Research Work ◽

Feature Space ◽

High Dimensional ◽

Data Set ◽

Data Points ◽

Low Dimensional ◽

Intrinsic Feature

Recently, anomaly detection has acquired a realistic response from data mining scientists as a graph of its reputation has increased smoothly in various practical domains like product marketing, fraud detection, medical diagnosis, fault detection and so many other fields. High dimensional data subjected to outlier detection poses exceptional challenges for data mining experts and it is because of natural problems of the curse of dimensionality and resemblance of distant and adjoining points. Traditional algorithms and techniques were experimented on full feature space regarding outlier detection. Customary methodologies concentrate largely on low dimensional data and hence show ineffectiveness while discovering anomalies in a data set comprised of a high number of dimensions. It becomes a very difficult and tiresome job to dig out anomalies present in high dimensional data set when all subsets of projections need to be explored. All data points in high dimensional data behave like similar observations because of its intrinsic feature i.e., the distance between observations approaches to zero as the number of dimensions extends towards infinity. This research work proposes a novel technique that explores deviation among all data points and embeds its findings inside well established density-based techniques. This is a state of art technique as it gives a new breadth of research towards resolving inherent problems of high dimensional data where outliers reside within clusters having different densities. A high dimensional dataset from UCI Machine Learning Repository is chosen to test the proposed technique and then its results are compared with that of density-based techniques to evaluate its efficiency.

Download Full-text

A Preview on Subspace Clustering of High Dimensional Data

INTERNATIONAL JOURNAL OF COMPUTERS & TECHNOLOGY ◽

10.24297/ijct.v6i3.4466 ◽

2013 ◽

Vol 6 (3) ◽

pp. 441-448 ◽

Cited By ~ 1

Author(s):

Sajid Nagi ◽

Dhruba Kumar Bhattacharyya ◽

Jugal K. Kalita

Keyword(s):

Search Strategy ◽

High Dimensional Data ◽

Subspace Clustering ◽

High Dimensional ◽

Expression Data ◽

Clustering Methods ◽

Top Down ◽

Data Points ◽

Low Dimensional ◽

Entire Dataset

When clustering high dimensional data, traditional clustering methods are found to be lacking since they consider all of the dimensions of the dataset in discovering clusters whereas only some of the dimensions are relevant. This may give rise to subspaces within the dataset where clusters may be found. Using feature selection, we can remove irrelevant and redundant dimensions by analyzing the entire dataset. The problem of automatically identifying clusters that exist in multiple and maybe overlapping subspaces of high dimensional data, allowing better clustering of the data points, is known as Subspace Clustering. There are two major approaches to subspace clustering based on search strategy. Top-down algorithms find an initial clustering in the full set of dimensions and evaluate the subspaces of each cluster, iteratively improving the results. Bottom-up approaches start from finding low dimensional dense regions, and then use them to form clusters. Based on a survey on subspace clustering, we identify the challenges and issues involved with clustering gene expression data.

Download Full-text

Optimal Clustering and Cluster Identity in Understanding High-Dimensional Data Spaces with Tightly Distributed Points

Machine Learning and Knowledge Extraction ◽

10.3390/make1020042 ◽

2019 ◽

Vol 1 (2) ◽

pp. 715-744 ◽

Cited By ~ 1

Author(s):

Oliver Chikumbo ◽

Vincent Granville

Keyword(s):

High Dimensional Data ◽

Optimal Number ◽

High Dimensional ◽

Distributed Data ◽

Objective Functions ◽

Number Of Clusters ◽

Data Points ◽

Pareto Fronts ◽

Criterion Method ◽

Optimal Number Of Clusters

The sensitivity of the elbow rule in determining an optimal number of clusters in high-dimensional spaces that are characterized by tightly distributed data points is demonstrated. The high-dimensional data samples are not artificially generated, but they are taken from a real world evolutionary many-objective optimization. They comprise of Pareto fronts from the last 10 generations of an evolutionary optimization computation with 14 objective functions. The choice for analyzing Pareto fronts is strategic, as it is squarely intended to benefit the user who only needs one solution to implement from the Pareto set, and therefore a systematic means of reducing the cardinality of solutions is imperative. As such, clustering the data and identifying the cluster from which to pick the desired solution is covered in this manuscript, highlighting the implementation of the elbow rule and the use of hyper-radial distances for cluster identity. The Calinski-Harabasz statistic was favored for determining the criteria used in the elbow rule because of its robustness. The statistic takes into account the variance within clusters and also the variance between the clusters. This exercise also opened an opportunity to revisit the justification of using the highest Calinski-Harabasz criterion for determining the optimal number of clusters for multivariate data. The elbow rule predicted the maximum end of the optimal number of clusters, and the highest Calinski-Harabasz criterion method favored the number of clusters at the lower end. Both results are used in a unique way for understanding high-dimensional data, despite being inconclusive regarding which of the two methods determine the true optimal number of clusters.

Download Full-text

Networked Exponential Families For Big Data Over Networks

10.36227/techrxiv.12674198 ◽

2020 ◽

Author(s):

Alexander Jung

Keyword(s):

Machine Learning ◽

Big Data ◽

Network Structure ◽

Message Passing ◽

Likelihood Function ◽

High Dimensional Data ◽

Exponential Families ◽

High Dimensional ◽

Parameter Estimates ◽

Data Points

We propose networked exponential families for non-parametric machine learning from massive network-structured datasets (“big data over networks”). High-dimensional data points are interpreted as the realizations of a random process distributed according to some exponential family. Networked exponential families allow to jointly leverage the information contained in high-dimensional data points and their network structure. For data points representing individuals, we obtain perfectly personalized models which enable high-precision medicine or more general recommendation systems.We learn the parameters of networked exponential families, using the network Lasso which implicitly pools (or clusters) the data points according to the intrinsic network structure and a local likelihood function. Our main theoretical result characterizes how the accuracy of network Lasso depends on the network structure and the information geometry of the node-wise exponential families. The network Lasso can be implemented as highly scalable message-passing over the data network. Such message passing is appealing for federated machine learning relying on edge computing. The proposed method is also privacy preserving in the sense that no raw data but only parameter (estimates) are shared among different nodes.

Download Full-text

Spectral Clustering and Vantage Point Indexing for Efficient Data Retrieval

International Journal of Electrical and Computer Engineering (IJECE) ◽

10.11591/ijece.v8i4.pp2261-2271 ◽

2018 ◽

Vol 8 (4) ◽

pp. 2261

Author(s):

Pushpalatha R. ◽

K. Meenakshi Sundaram

Keyword(s):

Spectral Clustering ◽

High Dimensional Data ◽

Data Retrieval ◽

True Positive Rate ◽

Vantage Point ◽

Space Complexity ◽

High Dimensional ◽

Retrieval Time ◽

User Query ◽

Data Points

Data mining is an essential process for identifying the patterns in large datasets through machine learning techniques and database systems. Clustering of high dimensional data is becoming very challenging process due to curse of dimensionality. In addition, space complexity and data retrieval performance was not improved. In order to overcome the limitation, Spectral Clustering Based VP Tree Indexing Technique is introduced. The technique clusters and indexes the densely populated high dimensional data points for effective data retrieval based on user query. A Normalized Spectral Clustering Algorithm is used to group similar high dimensional data points. After that, Vantage Point Tree is constructed for indexing the clustered data points with minimum space complexity. At last, indexed data gets retrieved based on user query using Vantage Point Tree based Data Retrieval Algorithm. This in turn helps to improve true positive rate with minimum retrieval time. The performance is measured in terms of space complexity, true positive rate and data retrieval time with El Nino weather data sets from UCI Machine Learning Repository. An experimental result shows that the proposed technique is able to reduce the space complexity by 33% and also reduces the data retrieval time by 24% when compared to state-of-the-art-works.

Download Full-text

A Survey on Anomalous Topic Discovery in High Dimensional Data

International Journal of Scientific Research in Science Engineering and Technology ◽

10.32628/ijsrset196148 ◽

2019 ◽

pp. 188-194

Keyword(s):

Anomaly Detection ◽

High Dimensional Data ◽

High Dimensional ◽

Small Scale ◽

Topic Discovery ◽

Detection Strategy ◽

Discrete Information ◽

Special Cases ◽

The Individual ◽

Investigation Strategy

Generally, finding of an unusual information i.e. anomalies from discrete information leads towards the better comprehension of atypical conduct of patterns and to recognize the base of anomalies. Anomalies can be characterized as the patterns that don't have ordinary conduct. It is likewise called as anomaly detection. Anomaly detection procedures are for the most part utilized for misrepresentation detection in charge cards, bank extortion; organize interruption and so on. It can be eluded as, oddities, deviation, special cases or exception. Such sort of patterns can't be seen to the diagnostic meaning of an exception, as uncommon question till it has been incorporated legitimately. A bunch investigation strategy is utilized to recognize small scale clusters shaped by these anomalies. In this paper, we show different techniques existed for recognizing anomalies from datasets which just distinguishes the individual anomalies. Issue with singular anomaly detection strategy that identifies anomalies utilizing the whole highlights commonly neglect to identify such anomalies. A strategy to recognize bunch of anomalous information join show atypical area of a little subset of highlights. This technique utilizes an invalid model to for commonplace topic and after that different test to identify all clusters of strange patterns.

Download Full-text

Functional Dimension Reduction for Chemometrics

Encyclopedia of Artificial Intelligence ◽

10.4018/978-1-59904-849-9.ch100 ◽

2011 ◽

pp. 661-666

Author(s):

Tuomas Kärnä ◽

Amaury Lendasse

Keyword(s):

Least Squares ◽

High Dimensional Data ◽

Computational Time ◽

High Dimensional ◽

Support Vector ◽

Data Set ◽

Spectrometric Data ◽

Function Fitting ◽

Svm Model ◽

Data Points

High dimensional data are becoming more and more common in data analysis. This is especially true in fields that are related to spectrometric data, such as chemometrics. Due to development of more accurate spectrometers one can obtain spectra of thousands of data points. Such a high dimensional data are problematic in machine learning due to increased computational time and the curse of dimensionality (Haykin, 1999; Verleysen & François, 2005; Bengio, Delalleau, & Le Roux, 2006). It is therefore advisable to reduce the dimensionality of the data. In the case of chemometrics, the spectra are usually rather smooth and low on noise, so function fitting is a convenient tool for dimensionality reduction. The fitting is obtained by fixing a set of basis functions and computing the fitting weights according to the least squares error criterion. This article describes a unsupervised method for finding a good function basis that is specifically built to suit the data set at hand. The basis consists of a set of Gaussian functions that are optimized for an accurate fitting. The obtained weights are further scaled using a Delta Test (DT) to improve the prediction performance. Least Squares Support Vector Machine (LS-SVM) model is used for estimation.

Download Full-text

Scalable Clustering of High-Dimensional Data Technique Using SPCM with Ant Colony Optimization Intelligence

The Scientific World JOURNAL ◽

10.1155/2015/107650 ◽

2015 ◽

Vol 2015 ◽

pp. 1-5 ◽

Cited By ~ 4

Author(s):

Thenmozhi Srinivasan ◽

Balasubramanie Palanisamy

Keyword(s):

Ant Colony Optimization ◽

Ant Colony Algorithm ◽

High Dimensional Data ◽

Poor Quality ◽

Ant Colony ◽

High Dimensional ◽

Cluster Number ◽

Scalable Clustering ◽

Cluster Data ◽

Synthetic Datasets

Clusters of high-dimensional data techniques are emerging, according to data noisy and poor quality challenges. This paper has been developed to cluster data using high-dimensional similarity based PCM (SPCM), with ant colony optimization intelligence which is effective in clustering nonspatial data without getting knowledge about cluster number from the user. The PCM becomes similarity based by using mountain method with it. Though this is efficient clustering, it is checked for optimization using ant colony algorithm with swarm intelligence. Thus the scalable clustering technique is obtained and the evaluation results are checked with synthetic datasets.

Download Full-text