Uniform Variance Product Quantization

2014 ◽  
Vol 651-653 ◽  
pp. 2224-2227
Author(s):  
Qin Zhen Guo ◽  
Zhi Zeng ◽  
Shu Wu Zhang

Product quantization (PQ) is an efficient and effective vector quantization approach to fast approximate nearest neighbor (ANN) search especially for high-dimensional data. The basic idea of PQ is to decompose the original data space into the Cartesian product of some low-dimensional subspaces and then every subspace is quantized separately with the same number of codewords. However, the performance of PQ depends largely on the distribution of the original data. If the distributions of every subspace have larger difference, PQ will achieve bad results as shown in our experiments. In this paper, we propose a uniform variance product quantization (UVPQ) scheme to project the data by a uniform variance projection before decompose it, which can minimize the subspace distribution difference of the whole space. UVPQ can guarantee good results however the data rotate. Extensive experiments have verified the superiority of UVPQ over PQ for ANN search.

2020 ◽  
Vol 49 (3) ◽  
pp. 421-437
Author(s):  
Genggeng Liu ◽  
Lin Xie ◽  
Chi-Hua Chen

Dimensionality reduction plays an important role in the data processing of machine learning and data mining, which makes the processing of high-dimensional data more efficient. Dimensionality reduction can extract the low-dimensional feature representation of high-dimensional data, and an effective dimensionality reduction method can not only extract most of the useful information of the original data, but also realize the function of removing useless noise. The dimensionality reduction methods can be applied to all types of data, especially image data. Although the supervised learning method has achieved good results in the application of dimensionality reduction, its performance depends on the number of labeled training samples. With the growing of information from internet, marking the data requires more resources and is more difficult. Therefore, using unsupervised learning to learn the feature of data has extremely important research value. In this paper, an unsupervised multilayered variational auto-encoder model is studied in the text data, so that the high-dimensional feature to the low-dimensional feature becomes efficient and the low-dimensional feature can retain mainly information as much as possible. Low-dimensional feature obtained by different dimensionality reduction methods are used to compare with the dimensionality reduction results of variational auto-encoder (VAE), and the method can be significantly improved over other comparison methods.


2013 ◽  
Vol 2013 ◽  
pp. 1-7 ◽  
Author(s):  
Jingpei Dan ◽  
Weiren Shi ◽  
Fangyan Dong ◽  
Kaoru Hirota

A time series representation, piecewise trend approximation (PTA), is proposed to improve efficiency of time series data mining in high dimensional large databases. PTA represents time series in concise form while retaining main trends in original time series; the dimensionality of original data is therefore reduced, and the key features are maintained. Different from the representations that based on original data space, PTA transforms original data space into the feature space of ratio between any two consecutive data points in original time series, of which sign and magnitude indicate changing direction and degree of local trend, respectively. Based on the ratio-based feature space, segmentation is performed such that each two conjoint segments have different trends, and then the piecewise segments are approximated by the ratios between the first and last points within the segments. To validate the proposed PTA, it is compared with classical time series representations PAA and APCA on two classical datasets by applying the commonly used K-NN classification algorithm. For ControlChart dataset, PTA outperforms them by 3.55% and 2.33% higher classification accuracy and 8.94% and 7.07% higher for Mixed-BagShapes dataset, respectively. It is indicated that the proposed PTA is effective for high dimensional time series data mining.


Author(s):  
Jing Wang ◽  
Jinglin Zhou ◽  
Xiaolu Chen

AbstractIndustrial data variables show obvious high dimension and strong nonlinear correlation. Traditional multivariate statistical monitoring methods, such as PCA, PLS, CCA, and FDA, are only suitable for solving the high-dimensional data processing with linear correlation. The kernel mapping method is the most common technique to deal with the nonlinearity, which projects the original data in the low-dimensional space to the high-dimensional space through appropriate kernel functions so as to achieve the goal of linear separability in the new space. However, the space projection from the low dimension to the high dimension is contradictory to the actual requirement of dimensionality reduction of the data. So kernel-based method inevitably increases the complexity of data processing.


Author(s):  
MIAO CHENG ◽  
BIN FANG ◽  
YUAN YAN TANG ◽  
HENGXIN CHEN

Many problems in pattern classification and feature extraction involve dimensionality reduction as a necessary processing. Traditional manifold learning algorithms, such as ISOMAP, LLE, and Laplacian Eigenmap, seek the low-dimensional manifold in an unsupervised way, while the local discriminant analysis methods identify the underlying supervised submanifold structures. In addition, it has been well-known that the intraclass null subspace contains the most discriminative information if the original data exist in a high-dimensional space. In this paper, we seek for the local null space in accordance with the null space LDA (NLDA) approach and reveal that its computational expense mainly depends on the quantity of connected edges in graphs, which may be still unacceptable if a great deal of samples are involved. To address this limitation, an improved local null space algorithm is proposed to employ the penalty subspace to approximate the local discriminant subspace. Compared with the traditional approach, the proposed method can achieve more efficiency so that the overload problem is avoided, while slight discriminant power is lost theoretically. A comparative study on classification shows that the performance of the approximative algorithm is quite close to the genuine one.


Author(s):  
Iain M. Johnstone ◽  
D. Michael Titterington

Modern applications of statistical theory and methods can involve extremely large datasets, often with huge numbers of measurements on each of a comparatively small number of experimental units. New methodology and accompanying theory have emerged in response: the goal of this Theme Issue is to illustrate a number of these recent developments. This overview article introduces the difficulties that arise with high-dimensional data in the context of the very familiar linear statistical model: we give a taste of what can nevertheless be achieved when the parameter vector of interest is sparse, that is, contains many zero elements. We describe other ways of identifying low-dimensional subspaces of the data space that contain all useful information. The topic of classification is then reviewed along with the problem of identifying, from within a very large set, the variables that help to classify observations. Brief mention is made of the visualization of high-dimensional data and ways to handle computational problems in Bayesian analysis are described. At appropriate points, reference is made to the other papers in the issue.


Author(s):  
Lifang Zhang ◽  
Qi Shen ◽  
Defang Li ◽  
Guocan Feng ◽  
Xin Tang ◽  
...  

Approximate Nearest Neighbor (ANN) search is a challenging problem with the explosive high-dimensional large-scale data in recent years. The promising technique for ANN search include hashing methods which generate compact binary codes by designing effective hash functions. However, lack of an optimal regularization is the key limitation of most of the existing hash functions. To this end, a new method called Adaptive Hashing with Sparse Modification (AHSM) is proposed. In AHSM, codes consist of vertices on the hypercube and the projection matrix is divided into two separate matrices. Data is rotated through a orthogonal matrix first and modified by a sparse matrix. Here the sparse matrix needs to be learned as a regularization item of hash function which is used to avoid overfitting and reduce quantization distortion. Totally, AHSM has two advantages: improvement of the accuracy without any time cost increasement. Furthermore, we extend AHSM to a supervised version, called Supervised Adaptive Hashing with Sparse Modification (SAHSM), by introducing Canonical Correlation Analysis (CCA) to the original data. Experiments show that the AHSM method stably surpasses several state-of-the-art hashing methods on four data sets. And at the same time, we compare three unsupervised hashing methods with their corresponding supervised version (including SAHSM) on three data sets with labels known. Similarly, SAHSM outperforms other methods on most of the hash bits.


Author(s):  
Xiaofeng Zhu ◽  
Cong Lei ◽  
Hao Yu ◽  
Yonggang Li ◽  
Jiangzhang Gan ◽  
...  

In this paper, we propose conducting Robust Graph Dimensionality Reduction (RGDR) by learning a transformation matrix to map original high-dimensional data into their low-dimensional intrinsic space without the influence of outliers. To do this, we propose simultaneously 1) adaptively learning three variables, \ie a reverse graph embedding of original data, a transformation matrix, and a graph matrix preserving the local similarity of original data in their low-dimensional intrinsic space; and 2) employing robust estimators to  avoid outliers involving the processes of optimizing these three matrices. As a result, original data are cleaned by two strategies, \ie a prediction of original data based on three resulting variables and robust estimators, so that the transformation matrix can be learnt from accurately estimated intrinsic space with the helping of the reverse graph embedding and the graph matrix. Moreover, we propose a new optimization algorithm to the resulting objective function as well as theoretically prove the convergence of our optimization algorithm. Experimental results indicated that our proposed method outperformed all the comparison methods in terms of different classification tasks.


2016 ◽  
Vol 2016 ◽  
pp. 1-16 ◽  
Author(s):  
Li Jiang ◽  
Shunsheng Guo

The high-dimensional features of defective bearings usually include redundant and irrelevant information, which will degrade the diagnosis performance. Thus, it is critical to extract the sensitive low-dimensional characteristics for improving diagnosis performance. This paper proposes modified kernel marginal Fisher analysis (MKMFA) for feature extraction with dimensionality reduction. Due to its outstanding performance in enhancing the intraclass compactness and interclass dispersibility, MKMFA is capable of effectively extracting the sensitive low-dimensional manifold characteristics beneficial to subsequent pattern classification even for few training samples. A MKMFA- based fault diagnosis model is presented and applied to identify different bearing faults. It firstly utilizes MKMFA to directly extract the low-dimensional manifold characteristics from the raw time-series signal samples in high-dimensional ambient space. Subsequently, the sensitive low-dimensional characteristics in feature space are inputted into K-nearest neighbor classifier so as to distinguish various fault patterns. The four-fault-type and ten-fault-severity bearing fault diagnosis experiment results show the feasibility and superiority of the proposed scheme in comparison with the other five methods.


Information ◽  
2019 ◽  
Vol 10 (6) ◽  
pp. 190
Author(s):  
Xinpan Yuan ◽  
Qunfeng Liu ◽  
Jun Long ◽  
Lei Hu ◽  
Songlin Wang

Image retrieval or content-based image retrieval (CBIR) can be transformed into the calculation of the distance between image feature vectors. The closer the vectors are, the higher the image similarity will be. In the image retrieval system for large-scale dataset, the approximate nearest-neighbor (ANN) search can quickly obtain the top k images closest to the query image, which is the Top-k problem in the field of information retrieval. With the traditional ANN algorithms, such as KD-Tree, R-Tree, and M-Tree, when the dimension of the image feature vector increases, the computing time will increase exponentially due to the curse of dimensionality. In order to reduce the calculation time and improve the efficiency of image retrieval, we propose an ANN search algorithm based on the Product Quantization Table (PQTable). After quantizing and compressing the image feature vectors by the product quantization algorithm, we can construct the image index structure of the PQTable, which speeds up image retrieval. We also propose a multi-PQTable query strategy for ANN search. Besides, we generate several nearest-neighbor vectors for each sub-compressed vector of the query vector to reduce the failure rate and improve the recall in image retrieval. Through theoretical analysis and experimental verification, it is proved that the multi-PQTable query strategy and the generation of several nearest-neighbor vectors are greatly correct and efficient.


Sign in / Sign up

Export Citation Format

Share Document