Dimensionality and Its Reduction

With the dramatic increase in data available from a new generation of astronomical telescopes and instruments, many analyses must address the question of the complexity as well as size of the data set. This chapter deals with how we can learn which measurements, properties, or combinations thereof carry the most information within a data set. It describes techniques that are related to concepts discussed when describing Gaussian distributions, density estimation, and the concepts of information content. The chapter begins with an exploration of the problems posed by high-dimensional data. It then describes the data sets used in this chapter, and introduces perhaps the most important and widely used dimensionality reduction technique, principal component analysis (PCA). The remainder of the chapter discusses several alternative techniques which address some of the weaknesses of PCA.

Download Full-text

Separation of the daily quiet variation from the geomagnetic field observations with the principal component analysis

10.5194/egusphere-egu2020-3423 ◽

2020 ◽

Author(s):

Anna Morozova ◽

Rania Rebbah ◽

M. Alexandra Pais

Keyword(s):

Principal Component Analysis ◽

Geomagnetic Field ◽

Daily Variation ◽

Extraction Procedure ◽

Principal Component ◽

Component Analysis ◽

Activity Level ◽

Data Series ◽

Data Sets ◽

Data Set

<p>Geomagnetic field (GMF) variations from external sources are classified as regular diurnal or occurring during periods of disturbances. The most significant regular variations are the quiet solar daily variation (Sq) and the disturbance daily variation (SD). These variations have well recognized daily cycles and need to be accounted for before the analysis of the disturbed field. Preliminary analysis of the GMF variations shows that the principal component analysis (PCA) is a useful tool for extraction of regular variations of GMF; however the requirements to the data set length, geomagnetic activity level etc. need to be established.</p><p>Here we present preliminary results of the PCA-based Sq extraction procedure based on the analysis of the Coimbra Geomagnetic Observatory (COI) measurements of the geomagnetic field components H, X, Y and Z between 2007 and 2015. The PCA-based Sq curves are compared with the standard ones obtained using 5 IQD per month. PCA was applied to data sets of different length: either 1 month-long data set for one of 2007-2015 years or data series for the same month but from different years (2007-2015) combined together. For most of the analyzed years the first PCA mode (PC1) was identified as SD variation and the second mode (PC2) was identified as Sq variation.</p>

Download Full-text

A SURVEY ON THE CURES FOR THE CURSE OF DIMENSIONALITY IN BIG DATA

Asian Journal of Pharmaceutical and Clinical Research ◽

10.22159/ajpcr.2017.v10s1.19755 ◽

2017 ◽

Vol 10 (13) ◽

pp. 355 ◽

Cited By ~ 1

Author(s):

Reshma Remesh ◽

Pattabiraman. V

Keyword(s):

Dimensionality Reduction ◽

Input Data ◽

Principal Component ◽

Kernel Principal Component Analysis ◽

High Dimensional ◽

Data Sets ◽

Learning Approaches ◽

Data Set ◽

Reduction Techniques ◽

Dimensionality Reduction Techniques

Dimensionality reduction techniques are used to reduce the complexity for analysis of high dimensional data sets. The raw input data set may have large dimensions and it might consume time and lead to wrong predictions if unnecessary data attributes are been considered for analysis. So using dimensionality reduction techniques one can reduce the dimensions of input data towards accurate prediction with less cost. In this paper the different machine learning approaches used for dimensionality reductions such as PCA, SVD, LDA, Kernel Principal Component Analysis and Artificial Neural Network have been studied.

Download Full-text

Combining multiway principal component analysis (MPCA) and clustering for efficient data mining of historical data sets of SBR processes

Water Science & Technology ◽

10.2166/wst.2008.143 ◽

2008 ◽

Vol 57 (10) ◽

pp. 1659-1666 ◽

Cited By ~ 30

Author(s):

Kris Villez ◽

Magda Ruiz ◽

Gürkan Sin ◽

Joan Colomer ◽

Christian Rosén ◽

...

Keyword(s):

Principal Component Analysis ◽

Clustering Algorithm ◽

Process Analysis ◽

Principal Component ◽

Component Analysis ◽

Data Sets ◽

Process Data ◽

Nitrogen And Phosphorus ◽

Online Data ◽

Data Set

A methodology based on Principal Component Analysis (PCA) and clustering is evaluated for process monitoring and process analysis of a pilot-scale SBR removing nitrogen and phosphorus. The first step of this method is to build a multi-way PCA (MPCA) model using the historical process data. In the second step, the principal scores and the Q-statistics resulting from the MPCA model are fed to the LAMDA clustering algorithm. This procedure is iterated twice. The first iteration provides an efficient and effective discrimination between normal and abnormal operational conditions. The second iteration of the procedure allowed a clear-cut discrimination of applied operational changes in the SBR history. Important to add is that this procedure helped identifying some changes in the process behaviour, which would not have been possible, had we only relied on visually inspecting this online data set of the SBR (which is traditionally the case in practice). Hence the PCA based clustering methodology is a promising tool to efficiently interpret and analyse the SBR process behaviour using large historical online data sets.

Download Full-text

Comparing the performance of linear and nonlinear principal components in the context of high-dimensional genomic data integration

Statistical Applications in Genetics and Molecular Biology ◽

10.1515/sagmb-2016-0066 ◽

2017 ◽

Vol 16 (3) ◽

Author(s):

Shofiqul Islam ◽

Sonia Anand ◽

Jemila Hamid ◽

Lehana Thabane ◽

Joseph Beyene

Keyword(s):

Principal Component Analysis ◽

Data Integration ◽

Principal Components ◽

Mirna Expression ◽

Principal Component ◽

Component Analysis ◽

Kernel Principal Component Analysis ◽

Data Sets ◽

Data Set ◽

Multiple Data Sets

AbstractLinear principal component analysis (PCA) is a widely used approach to reduce the dimension of gene or miRNA expression data sets. This method relies on the linearity assumption, which often fails to capture the patterns and relationships inherent in the data. Thus, a nonlinear approach such as kernel PCA might be optimal. We develop a copula-based simulation algorithm that takes into account the degree of dependence and nonlinearity observed in these data sets. Using this algorithm, we conduct an extensive simulation to compare the performance of linear and kernel principal component analysis methods towards data integration and death classification. We also compare these methods using a real data set with gene and miRNA expression of lung cancer patients. First few kernel principal components show poor performance compared to the linear principal components in this occasion. Reducing dimensions using linear PCA and a logistic regression model for classification seems to be adequate for this purpose. Integrating information from multiple data sets using either of these two approaches leads to an improved classification accuracy for the outcome.

Download Full-text

Differential Privacy Principal Component Analysis for Support Vector Machines

Security and Communication Networks ◽

10.1155/2021/5542283 ◽

2021 ◽

Vol 2021 ◽

pp. 1-12

Author(s):

Yuxian Huang ◽

Geng Yang ◽

Yahong Xu ◽

Hao Zhou

Keyword(s):

Principal Component Analysis ◽

Classification Accuracy ◽

Differential Privacy ◽

Principal Component ◽

Component Analysis ◽

High Dimensional ◽

Support Vector ◽

Data Sets ◽

Vector Machines ◽

Fast Classification

In big data era, massive and high-dimensional data is produced at all times, increasing the difficulty of analyzing and protecting data. In this paper, in order to realize dimensionality reduction and privacy protection of data, principal component analysis (PCA) and differential privacy (DP) are combined to handle these data. Moreover, support vector machine (SVM) is used to measure the availability of processed data in our paper. Specifically, we introduced differential privacy mechanisms at different stages of the algorithm PCA-SVM and obtained the algorithms DPPCA-SVM and PCADP-SVM. Both algorithms satisfy ε , 0 -DP while achieving fast classification. In addition, we evaluate the performance of two algorithms in terms of noise expectation and classification accuracy from the perspective of theoretical proof and experimental verification. To verify the performance of DPPCA-SVM, we also compare our DPPCA-SVM with other algorithms. Results show that DPPCA-SVM provides excellent utility for different data sets despite guaranteeing stricter privacy.

Download Full-text

Robust Bilinear Probabilistic Principal Component Analysis

Algorithms ◽

10.3390/a14110322 ◽

2021 ◽

Vol 14 (11) ◽

pp. 322

Author(s):

Yaohang Lu ◽

Zhongming Teng

Keyword(s):

Principal Component Analysis ◽

Real Life ◽

Principal Component ◽

Component Analysis ◽

Model Parameters ◽

Data Sets ◽

Gaussian Distributions ◽

Maximum Likelihood Procedure ◽

The Matrix ◽

Probabilistic Principal Component Analysis

Principal component analysis (PCA) is one of the most popular tools in multivariate exploratory data analysis. Its probabilistic version (PPCA) based on the maximum likelihood procedure provides a probabilistic manner to implement dimension reduction. Recently, the bilinear PPCA (BPPCA) model, which assumes that the noise terms follow matrix variate Gaussian distributions, has been introduced to directly deal with two-dimensional (2-D) data for preserving the matrix structure of 2-D data, such as images, and avoiding the curse of dimensionality. However, Gaussian distributions are not always available in real-life applications which may contain outliers within data sets. In order to make BPPCA robust for outliers, in this paper, we propose a robust BPPCA model under the assumption of matrix variate t distributions for the noise terms. The alternating expectation conditional maximization (AECM) algorithm is used to estimate the model parameters. Numerical examples on several synthetic and publicly available data sets are presented to demonstrate the superiority of our proposed model in feature extraction, classification and outlier detection.

Download Full-text

Noise-free principal component analysis: An efficient dimension reduction technique for high dimensional molecular data

Expert Systems with Applications ◽

10.1016/j.eswa.2014.06.024 ◽

2014 ◽

Vol 41 (17) ◽

pp. 7797-7804 ◽

Cited By ~ 6

Author(s):

Mansoor Rezghi ◽

Askar obulkasim

Keyword(s):

Principal Component Analysis ◽

Dimension Reduction ◽

Principal Component ◽

Component Analysis ◽

Molecular Data ◽

Reduction Technique ◽

High Dimensional

Download Full-text

Impact of sample size on principal component analysis ordination of an environmental data set: effects on eigenstructure

Ekológia (Bratislava) ◽

10.1515/eko-2016-0014 ◽

2016 ◽

Vol 35 (2) ◽

pp. 173-190 ◽

Cited By ~ 13

Author(s):

S. Shahid Shaukat ◽

Toqeer Ahmed Rao ◽

Moazzam A. Khan

Keyword(s):

Principal Component Analysis ◽

Sample Size ◽

Principal Component ◽

Component Analysis ◽

Small Sample ◽

Environmental Data ◽

Data Matrix ◽

Data Sets ◽

Data Set ◽

The Impact

AbstractIn this study, we used bootstrap simulation of a real data set to investigate the impact of sample size (N = 20, 30, 40 and 50) on the eigenvalues and eigenvectors resulting from principal component analysis (PCA). For each sample size, 100 bootstrap samples were drawn from environmental data matrix pertaining to water quality variables (p = 22) of a small data set comprising of 55 samples (stations from where water samples were collected). Because in ecology and environmental sciences the data sets are invariably small owing to high cost of collection and analysis of samples, we restricted our study to relatively small sample sizes. We focused attention on comparison of first 6 eigenvectors and first 10 eigenvalues. Data sets were compared using agglomerative cluster analysis using Ward’s method that does not require any stringent distributional assumptions.

Download Full-text

Multivariate Analysis for Chemistry-Property Relationships in Molten Salts

Zeitschrift für Naturforschung A ◽

10.1515/zna-2009-7-809 ◽

2009 ◽

Vol 64 (7-8) ◽

pp. 467-476 ◽

Cited By ~ 2

Author(s):

Changwon Suh ◽

Slobodan Gadzuric ◽

Marcelle Gaune-Escard ◽

Krishna Rajan

Keyword(s):

Principal Component Analysis ◽

Multivariate Analysis ◽

Molten Salt ◽

Molten Salts ◽

Principal Component ◽

Reduction Technique ◽

Data Dimensionality Reduction ◽

Dimensionality Reduction Technique ◽

The Relationship ◽

Bonding Characteristics

AbstractWe systematically analyze the molten salt database of Janz to gain a better understanding of the relationship between molten salts and their properties. Due to the multivariate nature of the database, the intercorrelations amongst the molten salts and their properties are often hidden and defining them is challenging. Using principal component analysis (PCA), a data dimensionality reduction technique, we have effectively identified chemistry-property relationships. From the various patterns in the PCA maps, it has been demonstrated that information extracted with PCA not only contains chemistryproperty relationships of molten salts, but also allows us to understand bonding characteristics and mechanisms of transport and melting, which are difficult to otherwise detect.

Download Full-text

Array Variate Random Variables with Multiway Kro- necker Delta Covariance Matrix Structure

Journal of Algebraic Statistics ◽

10.18409/jas.v2i1.12 ◽

2011 ◽

Vol 2 (1) ◽

Cited By ~ 19

Author(s):

Deniz Akdemir ◽

Arjun K. Gupta

Keyword(s):

Principal Component Analysis ◽

Covariance Matrix ◽

Random Variables ◽

Principal Component ◽

Random Variable ◽

High Dimensional ◽

Underlying Structure ◽

Data Sets ◽

Matrix Structure ◽

Background Material

Standard statistical methods applied to matrix random variables often fail to describethe underlying structure in multiway data sets. After a review of the essential background material,this paper introduces the notion of array variate random variable. A normal array variate randomvariable is dened and a method for estimating the parameters of array variate normal distributionis given. We introduce a technique called slicing for estimating the covariance matrix of highdimensional data. Finally, principal component analysis and classication techniques are developedfor array variate observations and high dimensional data.

Download Full-text