scholarly journals Efficient PCA-Exploration of High-Dimensional Datasets

Author(s):  
Oxana Ye. Rodionova ◽  
Sergey Kucheryavskiy ◽  
Alexey L. Pomerantsev

<div><div><div><p>Basic tools for exploration and interpretation of Principal Component Analysis (PCA) results are well- known and thoroughly described in many comprehensive tutorials. However, in the recent decade, several new tools have been developed. Some of them were originally created for solving authentication and classification tasks. In this paper we demonstrate that they can also be useful for the exploratory data analysis.</p><p><br></p><p>We discuss several important aspects of the PCA exploration of high dimensional datasets, such as estimation of a proper complexity of PCA model, dependence on the data structure, presence of outliers, etc. We introduce new tools for the assessment of the PCA model complexity such as the plots of the degrees of freedom developed for the orthogonal and score distances, as well as the Extreme and Distance plots, which present a new look at the features of the training and test (new) data. These tools are simple and fast in computation. In some cases, they are more efficient than the conventional PCA tools. A simulated example provides an intuitive illustration of their application. Three real-world examples originated from various fields are employed to demonstrate capabilities of the new tools and ways they can be used. The first example considers the reproducibility of a handheld spectrometer using a dataset that is presented for the first time. The other two datasets, which describe the authentication of olives in brine and classification of wines by their geographical origin, are already known and are often used for the illustrative purposes.</p><p><br></p><p>The paper does not touch upon the well-known things, such as the algorithms for the PCA decomposition, or interpretation of scores and loadings. Instead, we pay attention primarily to more advanced topics, such as exploration of data homogeneity, understanding and evaluation of an optimal model complexity. The examples are accompanied by links to free software that implements the tools.</p></div></div></div>

2020 ◽  
Author(s):  
Oxana Ye. Rodionova ◽  
Sergey Kucheryavskiy ◽  
Alexey L. Pomerantsev

<div><div><div><p>Basic tools for exploration and interpretation of Principal Component Analysis (PCA) results are well- known and thoroughly described in many comprehensive tutorials. However, in the recent decade, several new tools have been developed. Some of them were originally created for solving authentication and classification tasks. In this paper we demonstrate that they can also be useful for the exploratory data analysis.</p><p><br></p><p>We discuss several important aspects of the PCA exploration of high dimensional datasets, such as estimation of a proper complexity of PCA model, dependence on the data structure, presence of outliers, etc. We introduce new tools for the assessment of the PCA model complexity such as the plots of the degrees of freedom developed for the orthogonal and score distances, as well as the Extreme and Distance plots, which present a new look at the features of the training and test (new) data. These tools are simple and fast in computation. In some cases, they are more efficient than the conventional PCA tools. A simulated example provides an intuitive illustration of their application. Three real-world examples originated from various fields are employed to demonstrate capabilities of the new tools and ways they can be used. The first example considers the reproducibility of a handheld spectrometer using a dataset that is presented for the first time. The other two datasets, which describe the authentication of olives in brine and classification of wines by their geographical origin, are already known and are often used for the illustrative purposes.</p><p><br></p><p>The paper does not touch upon the well-known things, such as the algorithms for the PCA decomposition, or interpretation of scores and loadings. Instead, we pay attention primarily to more advanced topics, such as exploration of data homogeneity, understanding and evaluation of an optimal model complexity. The examples are accompanied by links to free software that implements the tools.</p></div></div></div>


Information ◽  
2018 ◽  
Vol 9 (12) ◽  
pp. 317 ◽  
Author(s):  
Vincenzo Dentamaro ◽  
Donato Impedovo ◽  
Giuseppe Pirlo

Multiclass classification in cancer diagnostics, using DNA or Gene Expression Signatures, but also classification of bacteria species fingerprints in MALDI-TOF mass spectrometry data, is challenging because of imbalanced data and the high number of dimensions with respect to the number of instances. In this study, a new oversampling technique called LICIC will be presented as a valuable instrument in countering both class imbalance, and the famous “curse of dimensionality” problem. The method enables preservation of non-linearities within the dataset, while creating new instances without adding noise. The method will be compared with other oversampling methods, such as Random Oversampling, SMOTE, Borderline-SMOTE, and ADASYN. F1 scores show the validity of this new technique when used with imbalanced, multiclass, and high-dimensional datasets.


2016 ◽  
Vol 29 (8) ◽  
pp. 3049-3056 ◽  
Author(s):  
Daniel S. Wilks

Abstract Principal component analysis (PCA), also known as empirical orthogonal function (EOF) analysis, is widely used for compression of high-dimensional datasets in such applications as climate diagnostics and seasonal forecasting. A critical question when using this method is the number of modes, representing meaningful signal, to retain. The resampling-based “Rule N” method attempts to address the question of PCA truncation in a statistically principled manner. However, it is only valid for the leading (largest) eigenvalue, because it fails to condition the hypothesis tests for subsequent (smaller) eigenvalues on the results of previous tests. This paper draws on several relatively recent statistical results to construct a hypothesis-test-based truncation rule that accounts at each stage for the magnitudes of the larger eigenvalues. The performance of the method is demonstrated in an artificial data setting and illustrated with a real-data example.


PLoS ONE ◽  
2021 ◽  
Vol 16 (1) ◽  
pp. e0246039
Author(s):  
Shilan S. Hameed ◽  
Rohayanti Hassan ◽  
Wan Haslina Hassan ◽  
Fahmi F. Muhammadsharif ◽  
Liza Abdul Latiff

The selection and classification of genes is essential for the identification of related genes to a specific disease. Developing a user-friendly application with combined statistical rigor and machine learning functionality to help the biomedical researchers and end users is of great importance. In this work, a novel stand-alone application, which is based on graphical user interface (GUI), is developed to perform the full functionality of gene selection and classification in high dimensional datasets. The so-called HDG-select application is validated on eleven high dimensional datasets of the format CSV and GEO soft. The proposed tool uses the efficient algorithm of combined filter-GBPSO-SVM and it was made freely available to users. It was found that the proposed HDG-select outperformed other tools reported in literature and presented a competitive performance, accessibility, and functionality.


2014 ◽  
pp. 32-42
Author(s):  
Matthieu Voiry ◽  
Kurosh Madani ◽  
Véronique Véronique Amarger ◽  
Joël Bernier

A major step for high-quality optical surfaces faults diagnosis concerns scratches and digs defects characterization in products. This challenging operation is very important since it is directly linked with the produced optical component’s quality. A classification phase is mandatory to complete optical devices diagnosis since a number of correctable defects are usually present beside the potential “abiding” ones. Unfortunately relevant data extracted from raw image during defects detection phase are high dimensional. This can have harmful effect on the behaviors of artificial neural networks which are suitable to perform such a challenging classification. Reducing data dimension to a smaller value can decrease the problems related to high dimensionality. In this paper we compare different techniques which permit dimensionality reduction and evaluate their impact on classification tasks performances.


2014 ◽  
Vol 11 (97) ◽  
pp. 20140353 ◽  
Author(s):  
N. König ◽  
W. R. Taylor ◽  
G. Armbrecht ◽  
R. Dietzel ◽  
N. B. Singh

Falls remain a challenge for ageing societies. Strong evidence indicates that a previous fall is the strongest single screening indicator for a subsequent fall and the need for assessing fall risk without accounting for fall history is therefore imperative. Testing in three functional domains (using a total 92 measures) were completed in 84 older women (60–85 years of age), including muscular control, standing balance, and mean and variability of gait. Participants were retrospectively classified as fallers ( n = 38) or non-fallers ( n = 42) and additionally in a prospective manner to identify first-time fallers (FTFs) ( n = 6) within a 12-month follow-up period. Principal component analysis revealed that seven components derived from the 92 functional measures are sufficient to depict the spectrum of functional performance. Inclusion of only three components, related to mean and temporal variability of walking, allowed classification of fallers and non-fallers with a sensitivity and specificity of 74% and 76%, respectively. Furthermore, the results indicate that FTFs show a tendency towards the performance of fallers, even before their first fall occurs. This study suggests that temporal variability and mean spatial parameters of gait are the only functional components among the 92 measures tested that differentiate fallers from non-fallers, and could therefore show efficacy in clinical screening programmes for assessing risk of first-time falling.


2021 ◽  
Vol 2 ◽  
Author(s):  
Giuseppe D’Alessio ◽  
Alberto Cuoci ◽  
Alessandro Parente

Abstract The integration of Artificial Neural Networks (ANNs) and Feature Extraction (FE) in the context of the Sample- Partitioning Adaptive Reduced Chemistry approach was investigated in this work, to increase the on-the-fly classification accuracy for very large thermochemical states. The proposed methodology was firstly compared with an on-the-fly classifier based on the Principal Component Analysis reconstruction error, as well as with a standard ANN (s-ANN) classifier, operating on the full thermochemical space, for the adaptive simulation of a steady laminar flame fed with a nitrogen-diluted stream of n-heptane in air. The numerical simulations were carried out with a kinetic mechanism accounting for 172 species and 6,067 reactions, which includes the chemistry of Polycyclic Aromatic Hydrocarbons (PAHs) up to C $ {}_{20} $ . Among all the aforementioned classifiers, the one exploiting the combination of an FE step with ANN proved to be more efficient for the classification of high-dimensional spaces, leading to a higher speed-up factor and a higher accuracy of the adaptive simulation in the description of the PAH and soot-precursor chemistry. Finally, the investigation of the classifier’s performances was also extended to flames with different boundary conditions with respect to the training one, obtained imposing a higher Reynolds number or time-dependent sinusoidal perturbations. Satisfying results were observed on all the test flames.


Complexity ◽  
2021 ◽  
Vol 2021 ◽  
pp. 1-15
Author(s):  
Mujtaba Husnain ◽  
Malik Muhammad Saad Missen ◽  
Shahzad Mumtaz ◽  
Dost Muhammad Khan ◽  
Mickäel Coustaty ◽  
...  

In this paper, we make use of the 2-dimensional data obtained through t-Stochastic Neighborhood Embedding (t-SNE) when applied on high-dimensional data of Urdu handwritten characters and numerals. The instances of the dataset used for experimental work are classified in multiple classes depending on the shape similarity. We performed three tasks in a disciplined order; namely, (i) we generated a state-of-the-art dataset of both the Urdu handwritten characters and numerals by inviting a number of native Urdu participants from different social and academic groups, since there is no publicly available dataset of such type till date, then (ii) applied classical approaches of dimensionality reduction and data visualization like Principal Component Analysis (PCA), Autoencoders (AE) in comparison with t-Stochastic Neighborhood Embedding (t-SNE), and (iii) used the reduced dimensions obtained through PCA, AE, and t-SNE for recognition of Urdu handwritten characters and numerals using a deep network like Convolution Neural Network (CNN). The accuracy achieved in recognition of Urdu characters and numerals among the approaches for the same task is found to be much better. The novelty lies in the fact that the resulting reduced dimensions are used for the first time for the recognition of Urdu handwritten text at the character level instead of using the whole multidimensional data. This results in consuming less computation time with the same accuracy when compared with processing time consumed by recognition approaches applied to other datasets for the same task using the whole data.


2014 ◽  
Vol 24 (1) ◽  
pp. 123-131
Author(s):  
Simon Gangl ◽  
Domen Mongus ◽  
Borut Žalik

Abstract Systems based on principal component analysis have developed from exploratory data analysis in the past to current data processing applications which encode and decode vectors of data using a changing projection space (eigenspace). Linear systems, which need to be solved to obtain a constantly updated eigenspace, have increased significantly in their dimensions during this evolution. The basic scheme used for updating the eigenspace, however, has remained basically the same: (re)computing the eigenspace whenever the error exceeds a predefined threshold. In this paper we propose a computationally efficient eigenspace updating scheme, which specifically supports high-dimensional systems from any domain. The key principle is a prior selection of the vectors used to update the eigenspace in combination with an optimized eigenspace computation. The presented theoretical analysis proves the superior reconstruction capability of the introduced scheme, and further provides an estimate of the achievable compression ratios.


2014 ◽  
Vol 602-605 ◽  
pp. 3867-3870 ◽  
Author(s):  
Xiao Hong Wu ◽  
Sheng Wei Qiu ◽  
Xiang Li ◽  
Bin Wu ◽  
Min Li ◽  
...  

Pork storage time is relevant to its freshness which influences pork quality. To achieve the rapid and effective discrimination of pork storage time, near infrared spectroscopy was used to collect the near infrared reflectance (NIR) spectra of pork in different storage time. The high-dimensional NIR spectra was firstly compressed by principal component analysis (PCA) and then classified by fuzzy learning vector quantization (FLVQ). PCA plus FLVQ is a completely unsupervised learning algorithm which finds hidden patterns in unlabeled data. Experimental results showed that PCA plus FLVQ could classify pork NIR spectra effectively.


Sign in / Sign up

Export Citation Format

Share Document