scholarly journals On the Fallibility of Principal Components in Research

2016 ◽  
Vol 77 (1) ◽  
pp. 165-178 ◽  
Author(s):  
Tenko Raykov ◽  
George A. Marcoulides ◽  
Tenglong Li

The measurement error in principal components extracted from a set of fallible measures is discussed and evaluated. It is shown that as long as one or more measures in a given set of observed variables contains error of measurement, so also does any principal component obtained from the set. The error variance in any principal component is shown to be (a) bounded from below by the smallest error variance in a variable from the analyzed set and (b) bounded from above by the largest error variance in a variable from that set. In the case of a unidimensional set of analyzed measures, it is pointed out that the reliability and criterion validity of any principal component are bounded from above by these respective coefficients of the optimal linear combination with maximal reliability and criterion validity (for a criterion unrelated to the error terms in the individual measures). The discussed psychometric features of principal components are illustrated on a numerical data set.

2021 ◽  
Author(s):  
Amélie Fischer ◽  
Philippe Gasnier ◽  
Philippe Faverdin

ABSTRACTBackgroundImproving feed efficiency has become a common target for dairy farmers to meet the requirement of producing more milk with fewer resources. To improve feed efficiency, a prerequisite is to ensure that the cows identified as most or least efficient will remain as such, independently of diet composition. Therefore, the current research analysed the ability of lactating dairy cows to maintain their feed efficiency while changing the energy density of the diet by changing its concentration in starch and fibre. A total of 60 lactating Holstein cows, including 33 primiparous cows, were first fed a high starch diet (diet E+P+), then switched over to a low starch diet (diet E−P−). Near infra-red (NIR) spectroscopy was performed on each individual feed ingredient, diet and individual refusals to check for sorting behaviour. A principal component analysis (PCA) was performed to analyse if the variability in NIR spectra of the refusals was explained by the differences in feed efficiency.ResultsThe error of reproducibility of feed efficiency across diets was 2.95 MJ/d. This error was significantly larger than the errors of repeatability estimated within diet over two subsequent lactation stages, which were 2.01 MJ/d within diet E−P− and 2.40 MJ/d within diet E+P+. The coefficient of correlation of concordance (CCC) was 0.64 between feed efficiency estimated within diet E+P+ and feed efficiency estimated within diet E−P−. This CCC was smaller than the one observed for feed efficiency estimated within diet between two subsequent lactation stages (CCC = 0.72 within diet E+P+ and 0.85 within diet E−P−). The first two principal components of the PCA explained 90% of the total variability of the NIR spectra of the individual refusals. Feed efficiency was poorly correlated to those principal components, which suggests that feed sorting behaviour did not explain differences in feed efficiency.ConclusionsFeed efficiency was significantly less reproducible across diets than repeatable within the same diet over subsequent lactation stages, but cow’s ranking for feed efficiency was not significantly affected by diet change. The differences in sorting behaviour between cows were not associated to feed efficiency differences in this trial neither with the E+P+ diet nor with the E−P− diet. Those results have to be confirmed with cows fed with more extreme diets (for example roughage only) to ensure that the least and most efficient cows will not change.


2018 ◽  
Vol 17 ◽  
pp. 117693511877108 ◽  
Author(s):  
Min Wang ◽  
Steven M Kornblau ◽  
Kevin R Coombes

Principal component analysis (PCA) is one of the most common techniques in the analysis of biological data sets, but applying PCA raises 2 challenges. First, one must determine the number of significant principal components (PCs). Second, because each PC is a linear combination of genes, it rarely has a biological interpretation. Existing methods to determine the number of PCs are either subjective or computationally extensive. We review several methods and describe a new R package, PCDimension, that implements additional methods, the most important being an algorithm that extends and automates a graphical Bayesian method. Using simulations, we compared the methods. Our newly automated procedure is competitive with the best methods when considering both accuracy and speed and is the most accurate when the number of objects is small compared with the number of attributes. We applied the method to a proteomics data set from patients with acute myeloid leukemia. Proteins in the apoptosis pathway could be explained using 6 PCs. By clustering the proteins in PC space, we were able to replace the PCs by 6 “biological components,” 3 of which could be immediately interpreted from the current literature. We expect this approach combining PCA with clustering to be widely applicable.


Author(s):  
Janja Jerebic ◽  
Špela Kajzer ◽  
Anja Goričan ◽  
Drago Bokal

The management of fishing fleets is an important factor in the sustainable exploitation of marine organisms for human consumption. Therefore, regulatory services monitor catches and limit them based on data. In this paper, we analyze North Atlantic Fishing Organization (NAFO) data on North Atlantic catches to direct the effectiveness of fishing stakeholders. Data on fishing time (month and year), equipment, location, type of catch, and, for us, the most interesting, data on the fishing effort are given, and their quality is analyzed. In the last part, The Principal Component Analysis for individual activities, among which fishing stakeholders can decide, is performed on a selected data sample. The complexity of the connections between the set of observed activities is explained by new uncorrelated variables - principal components - that are important for achieving the expected fishing catch. We find that the proportions of variance explained by the individual principal components are low, which indicates the high complexity of the topic discussed.


2017 ◽  
Vol 45 (1) ◽  
pp. 262-269 ◽  
Author(s):  
Karim ENNOURI ◽  
Rayda BEN AYED ◽  
Sezai ERCISLI ◽  
Fathi BEN AMAR ◽  
Mohamed Ali TRIKI

The olive trees (Olea europaea L.) have been cultivated for millennia in the Mediterranean basin and its oil has been an important part of human nutrition in the region. In order to distinguish between olive accessions, morphological and biological characters have been widely and commonly used for descriptive purposes and have been used to characterize olive accessions. A comparative study of morphological characters of olive accessions grown in Tunisia was carried out and analyzed using Bayesian Networks (BN) and Principal Components Analysis (PCA). The obtained results showed that averages of fruit and kernel weights were 2.27 grams and 0.41 grams, respectively.  Besides, a relatively moderate level of variation (51.22%) being explained by four Principal components. BN revealed that geographical localisation plays a role in the increase of tree habit, size of lenticels and leaf shape. A dendrogram has been carried out in the aim to classify studied olive accessions. We proposed a novel method of analysis based on the three-step scheme, in which first the data set is clustered, then olive tree features are evaluated. The studied accessions can be divided into four main groups by cutting the dendrogram at a similarity value of 0.645. Different relationships are studied and highlighted, and finally the collected features are subjected to a global principal component analysis. Obtained results confirmed that core surface was negatively correlated with geographical location (r = -0.52, p<0.05) and maturation period r = -0.539, p<0.05). Number of lenticels was positively correlated to lenticels size (r = 0.632, p<0.05). Core shape had a negative correlation with fruit shape (r = -0.759, p<0.05). On the basis of these findings, this research confirmed that morphological markers are a preliminary tool to characterize olive oil accessions.


2018 ◽  
Vol 48 (9) ◽  
Author(s):  
Déborah Galvão Peixôto Guedes ◽  
Maria Norma Ribeiro ◽  
Francisco Fernando Ramos de Carvalho

ABSTRACT: This study aimed to use multivariate techniques of principal component analysis and canonical discriminant analysis in a data set from Morada Nova sheep carcass to reduce the dimensions of the original data set, identify variables with the best discriminatory power among the treatments, and quantify the association between biometric and performance traits. The principal components obtained were efficient in reducing the total variation accumulated in 19 original variables correlated to five linear combinations, which explained 80% of the total variation present in the original variables. The first two principal components together accounted for 56.12% of the total variation of the evaluated variables. Eight variables were selected using the stepwise method. The first three canonical variables were significant, explaining 92.25% of the total variation. The first canonical variable showed a canonical correlation coefficient of 0.94, indicating a strong association between biometric traits and animal performance. Slaughter weight and hind width were selected because these variables presented the highest discriminatory power among all treatments, based on standard canonical coefficients.


2017 ◽  
Vol 78 (6) ◽  
pp. 1108-1122
Author(s):  
Yuanshu Fu ◽  
Zhonglin Wen ◽  
Yang Wang

The maximal reliability of a congeneric measure is achieved by weighting item scores to form the optimal linear combination as the total score; it is never lower than the composite reliability of the measure when measurement errors are uncorrelated. The statistical method that renders maximal reliability would also lead to maximal criterion validity. Using a career satisfaction measure as an example, the present article calculated the maximal reliability and maximal criterion validity and compared them with the composite reliability and the scale criterion validity, respectively. The improvement of reliability and validity indicated that the optimal linear combination is preferred when forming a total score of a measure. The Mplus codes for analyzing maximal reliability, maximal criterion validity, and related parameters are provided.


2014 ◽  
Vol 80 (19) ◽  
pp. 6062-6072 ◽  
Author(s):  
Cresten B. Mansfeldt ◽  
Annette R. Rowe ◽  
Gretchen L. W. Heavner ◽  
Stephen H. Zinder ◽  
Ruth E. Richardson

ABSTRACTA cDNA-microarray was designed and used to monitor the transcriptomic profile ofDehalococcoides mccartyistrain 195 (in a mixed community) respiring various chlorinated organics, including chloroethenes and 2,3-dichlorophenol. The cultures were continuously fed in order to establish steady-state respiration rates and substrate levels. The organization of array data into a clustered heat map revealed two major experimental partitions. This partitioning in the data set was further explored through principal component analysis. The first two principal components separated the experiments into those with slow (1.6 ± 0.6 μM Cl−/h)- and fast (22.9 ± 9.6 μM Cl−/h)-respiring cultures. Additionally, the transcripts with the highest loadings in these principal components were identified, suggesting that those transcripts were responsible for the partitioning of the experiments. By analyzing the transcriptomes (n= 53) across experiments, relationships among transcripts were identified, and hypotheses about the relationships between electron transport chain members were proposed. One hypothesis, that the hydrogenases Hup and Hym and the formate dehydrogenase-like oxidoreductase (DET0186-DET0187) form a complex (as displayed by their tight clustering in the heat map analysis), was explored using a nondenaturing protein separation technique combined with proteomic sequencing. Although these proteins did not migrate as a single complex, DET0112 (an FdhB-like protein encoded in the Hup operon) was found to comigrate with DET0187 rather than with the catalytic Hup subunit DET0110. On closer inspection of the genome annotations of allDehalococcoidesstrains, the DET0185-to-DET0187 operon was found to lack a key subunit, an FdhB-like protein. Therefore, on the basis of the transcriptomic, genomic, and proteomic evidence, the place of the missing subunit in the DET0185-to-DET0187 operon is likely filled by recruiting a subunit expressed from the Hup operon (DET0112).


Author(s):  
Shofiqul Islam ◽  
Sonia Anand ◽  
Jemila Hamid ◽  
Lehana Thabane ◽  
Joseph Beyene

AbstractLinear principal component analysis (PCA) is a widely used approach to reduce the dimension of gene or miRNA expression data sets. This method relies on the linearity assumption, which often fails to capture the patterns and relationships inherent in the data. Thus, a nonlinear approach such as kernel PCA might be optimal. We develop a copula-based simulation algorithm that takes into account the degree of dependence and nonlinearity observed in these data sets. Using this algorithm, we conduct an extensive simulation to compare the performance of linear and kernel principal component analysis methods towards data integration and death classification. We also compare these methods using a real data set with gene and miRNA expression of lung cancer patients. First few kernel principal components show poor performance compared to the linear principal components in this occasion. Reducing dimensions using linear PCA and a logistic regression model for classification seems to be adequate for this purpose. Integrating information from multiple data sets using either of these two approaches leads to an improved classification accuracy for the outcome.


2000 ◽  
Vol 88 (4) ◽  
pp. 1431-1437 ◽  
Author(s):  
Giovanna Zimatore ◽  
Alessandro Giuliani ◽  
Calogero Parlapiano ◽  
Giorgio Grisanti ◽  
Alfredo Colosimo

Click-evoked otoacoustic emissions (CEOAEs) were studied by means of recurrence quantification analysis (RQA) and were found to be endowed with a relevant amount of deterministic structuring. Such a structure showed highly significant correlation with the clinical evaluation of the signal over a data set including 56 signals. Moreover, 1) one of the RQA variables, Trend, was very sensitive to phase transitions in the dynamical regime of CEOAEs, and 2) appropriate use of principal component analysis proved able to isolate the individual character of the studied signals. These results are of general interest for the study of auditory signal transduction and generation mechanisms.


2019 ◽  
Author(s):  
V.M. Efimov ◽  
K.V. Efimov ◽  
V.Y. Kovaleva

In the 40s of the last century, Karhunen and Loève proposed a method for processing of one-dimensional numeric time series by converting it into a multidimensional by shifts. In fact, a one-dimensional number series was decomposed into several orthogonal time series. This method has many times been independently developed and applied in practice under various names (EOF, SSA, Caterpillar, etc.). Nowadays, the name SSA (the Singular Spectral Analysis) is most often used. It turned out that it is universal, applicable to any time series without requiring stationary assumptions, automatically decomposes time series into a trend, cyclic components and noise. By the beginning of the 1980s Takens showed that for a dynamical system such a method makes it possible to obtain an attractor from observing only one of these variables, thereby bringing the method to a powerful theoretical basis. In the same years, the practical benefits of phase portraits became clear. In particular, it was used in the analysis and forecast of the animal abundance dynamics.In this paper we propose to extend SSA to one-dimensional sequence of any type elements, including numbers, symbols, figures, etc., and, as a special case, to molecular sequence. Technically, the problem is solved almost the same algorithm as the SSA. The sequence is cut by a sliding window into fragments of a given length. Between all fragments, the matrix of Euclidean distances is calculated. This is always possible. For example, the square root from the Hamming distance between fragments is the Euclidean distance. For the resulting matrix, the principal components are calculated by the principal-coordinate method (PCo). Instead of a distance matrix one can use a matrix of any similarity/dissimilarity indexes and apply methods of multidimensional scaling (MDS). The result will always be PCs in some Euclidean space.We called this method PCA-Seq. It is certainly an exploratory method, as its particular case SSA. For any sequence, including molecular, PCA-Seq without any additional assumptions allows to get its principal components in a numerical form and visualize them in the form of phase portraits. Long-term experience of SSA application for numerical data gives all reasons to believe that PCA-Seq will be not less useful in the analysis of non-numerical data, especially in hypothesizing.PCA-Seq is implemented in the freely distributed Jacobi 4 package (http://mrherrn.github.io/JACOBI4/).


Sign in / Sign up

Export Citation Format

Share Document