scholarly journals Improved One-Class Modeling of High-Dimensional Metabolomics Data via Eigenvalue-Shrinkage

Metabolites ◽  
2021 ◽  
Vol 11 (4) ◽  
pp. 237
Author(s):  
Alberto Brini ◽  
Vahe Avagyan ◽  
Ric C. H. de Vos ◽  
Jack H. Vossen ◽  
Edwin R. van den Heuvel ◽  
...  

One-class modelling is a useful approach in metabolomics for the untargeted detection of abnormal metabolite profiles, when information from a set of reference observations is available to model “normal” or baseline metabolite profiles. Such outlying profiles are typically identified by comparing the distance between an observation and the reference class to a critical limit. Often, multivariate distance measures such as the Mahalanobis distance (MD) or principal component-based measures are used. These approaches, however, are either not applicable to untargeted metabolomics data, or their results are unreliable. In this paper, five distance measures for one-class modeling in untargeted metabolites are proposed. They are based on a combination of the MD and five so-called eigenvalue-shrinkage estimators of the covariance matrix of the reference class. A simple cross-validation procedure is proposed to set the critical limit for outlier detection. Simulation studies are used to identify which distance measure provides the best performance for one-class modeling, in terms of type I error and power to identify abnormal metabolite profiles. Empirical evidence demonstrates that this method has better type I error (false positive rate) and improved outlier detection power than the standard (principal component-based) one-class models. The method is illustrated by its application to liquid chromatography coupled to mass spectrometry (LC-MS) and nuclear magnetic response spectroscopy (NMR) untargeted metabolomics data from two studies on food safety assessment and diagnosis of rare diseases, respectively.

2020 ◽  
Author(s):  
Piyi Xing ◽  
Zhenqiao Song ◽  
Xingfeng Li

AbstractWheatgrass has emerged as a functional food source in recent years, but the detailed metabolomics basis for its health benefits remains poorly understood. In this study, liquid chromatography-mass spectrometry (LC-MS) analysis were used to study the metabolic profiling of seedlings from wheat, barley, rye and triticale, which revealed 1800 features in positive mode and 4303 features in negative mode. Principal component analysis (PCA) showed clear differences between species, and 164 differentially expressed metabolites (DEMs) were detected, including amino acids, organic acids, lipids, fatty acids, nucleic acids, flavonoids, amines, polyamines, vitamins, sugar derivatives and others. Unique metabolites in each species were identified. This study provides a glimpse into the metabolomics profiles of wheat and its wild relatives, which may form an important basis for nutrition, health and other parameters.Practical ApplicationThis manuscript present liquid chromatography-mass spectrometry (LC-MS) results of young sprouts of common wheat and its relatives. Our results may help to better understand the natural variation due to the genotype before metabolomics data are considered for application to wheatgrass and can provide a basis (assessment) for its potential pharmaceutical and nutritional value.


2019 ◽  
Vol 20 (4) ◽  
pp. 818 ◽  
Author(s):  
Bowen Yan ◽  
Faizan Sadiq ◽  
Yijie Cai ◽  
Daming Fan ◽  
Hao Zhang ◽  
...  

Untargeted metabolomics is a valuable tool to analyze metabolite profiles or aroma fingerprints of different food products. However, less attention has been paid to determining the aroma characteristics of Chinese steamed breads (CSBs) by using this approach. The aim of this work was to evaluate the key aroma compounds and their potential generation pathway in Chinese steamed bread produced with type I sourdough by a metabolomics approach. Based on the aroma characteristics analysis, CSBs produced with type I sourdough and baker’s yeast were clearly distinguishable by principal component analysis (PCA) scores plot. A total of 13 compounds in sourdough-based steamed breads were given the status of discriminant markers through the untargeted metabolomics analysis. According to the odor activity values (OAVs) of discriminant aroma markers, ethyl acetate (fruity), ethyl lactate (caramel-like), hexyl acetate (fruity), (E)-2-nonenal (fatty) and 2-pentylfuran (fruity) were validated as the key volatile compounds in the breads produced with type I sourdough as compared to the baker’s yeast leavened steamed bread. The metabolite analysis in proofed dough indicated that esters are mainly generated by the reaction between acid and alcohol during steaming, and aldehydes are derived from the oxidation of palmitoleic acid and linoleic acid during proofing and steaming.


Mathematics ◽  
2018 ◽  
Vol 6 (11) ◽  
pp. 269 ◽  
Author(s):  
Sergio Camiz ◽  
Valério Pillar

The identification of a reduced dimensional representation of the data is among the main issues of exploratory multidimensional data analysis and several solutions had been proposed in the literature according to the method. Principal Component Analysis (PCA) is the method that has received the largest attention thus far and several identification methods—the so-called stopping rules—have been proposed, giving very different results in practice, and some comparative study has been carried out. Some inconsistencies in the previous studies led us to try to fix the distinction between signal from noise in PCA—and its limits—and propose a new testing method. This consists in the production of simulated data according to a predefined eigenvalues structure, including zero-eigenvalues. From random populations built according to several such structures, reduced-size samples were extracted and to them different levels of random normal noise were added. This controlled introduction of noise allows a clear distinction between expected signal and noise, the latter relegated to the non-zero eigenvalues in the samples corresponding to zero ones in the population. With this new method, we tested the performance of ten different stopping rules. Of every method, for every structure and every noise, both power (the ability to correctly identify the expected dimension) and type-I error (the detection of a dimension composed only by noise) have been measured, by counting the relative frequencies in which the smallest non-zero eigenvalue in the population was recognized as signal in the samples and that in which the largest zero-eigenvalue was recognized as noise, respectively. This way, the behaviour of the examined methods is clear and their comparison/evaluation is possible. The reported results show that both the generalization of the Bartlett’s test by Rencher and the Bootstrap method by Pillar result much better than all others: both are accounted for reasonable power, decreasing with noise, and very good type-I error. Thus, more than the others, these methods deserve being adopted.


Author(s):  
Vinicius Francisco Rofatto ◽  
Marcelo Tomio Matsuoka ◽  
Ivandro Klein ◽  
Mauricio Roberto Veronez ◽  
Luiz Gonzaga da Silveira Jr.

An iterative outlier elimination procedure based on hypothesis testing, commonly known as Iterative Data Snooping (IDS) among geodesists, is often used for the quality control of the modern measurement systems in geodesy and surveying. The test statistic associated with IDS is the extreme normalised least-squares residual. It is well-known in the literature that critical values (quantile values) of such a test statistic cannot be derived from well-known test distributions, but must be computed numerically by means of Monte Carlo. This paper provides the first results about Monte Carlo-based critical value inserted to different scenarios of correlation between the outlier statistics. From the Monte Carlo evaluation, we compute the probabilities of correct identification, missed detection, wrong exclusion, overidentifications and statistical overlap associated with IDS in the presence of a single outlier. Based on such probability levels we obtain the Minimal Detectable Bias (MDB) and Minimal Identifiable Bias (MIB) for the case where IDS is in play. MDB and MIB are sensitivity indicators for outlier detection and identification, respectively. The results show that there are circumstances that the larger the Type I decision error (smaller critical value), the higher the rates of outlier detection, but the lower the rates of outlier identification. For that case, the larger the Type I Error, the larger the ratio between MIB and MDB. We also highlight that an outlier becomes identifiable when the contribution of the measures to the wrong exclusion rate decline simultaneously. In that case, we verify that the effect of the correlation between the outlier statistics on the wrong exclusion rates becomes insignificant from a certain outlier magnitude, which increases the probability of identification.


2019 ◽  
Vol 27 (4) ◽  
pp. 253-258 ◽  
Author(s):  
A Garrido-Varo ◽  
J Garcia-Olmo ◽  
T Fearn

In identifying spectral outliers in near infrared calibration it is common to use a distance measure that is related to Mahalanobis distance. However, different software packages tend to use different variants, which lead to a translation problem if more than one package is used. Here the relationships between squared Mahalanobis distance D2, the GH distance of WinISI, and the T2 and leverage (L) statistics of The Unscrambler are established as D2 = T2 ≈ L × n ≈ GH × k, where n and k are the numbers of samples and variables, respectively, in the set of spectral data used to establish the distance measure. The implications for setting thresholds for outlier detection are discussed. On the way to this result the principal component scores from WinISI and The Unscrambler are compared. Both packages scale the scores for a component to have variances proportional to the contribution of that component to total variance, but the WinISI scores, unlike those from The Unscrambler, do not have mean zero.


2020 ◽  
Vol 12 (5) ◽  
pp. 860 ◽  
Author(s):  
Vinicius Francisco Rofatto ◽  
Marcelo Tomio Matsuoka ◽  
Ivandro Klein ◽  
Maurício Roberto Veronez ◽  
Luiz Gonzaga da Silveira

An iterative outlier elimination procedure based on hypothesis testing, commonly known as Iterative Data Snooping (IDS) among geodesists, is often used for the quality control of modern measurement systems in geodesy and surveying. The test statistic associated with IDS is the extreme normalised least-squares residual. It is well-known in the literature that critical values (quantile values) of such a test statistic cannot be derived from well-known test distributions but must be computed numerically by means of Monte Carlo. This paper provides the first results on the Monte Carlo-based critical value inserted into different scenarios of correlation between outlier statistics. From the Monte Carlo evaluation, we compute the probabilities of correct identification, missed detection, wrong exclusion, over-identifications and statistical overlap associated with IDS in the presence of a single outlier. On the basis of such probability levels, we obtain the Minimal Detectable Bias (MDB) and Minimal Identifiable Bias (MIB) for cases in which IDS is in play. The MDB and MIB are sensitivity indicators for outlier detection and identification, respectively. The results show that there are circumstances in which the larger the Type I decision error (smaller critical value), the higher the rates of outlier detection but the lower the rates of outlier identification. In such a case, the larger the Type I Error, the larger the ratio between the MIB and MDB. We also highlight that an outlier becomes identifiable when the contributions of the measures to the wrong exclusion rate decline simultaneously. In this case, we verify that the effect of the correlation between outlier statistics on the wrong exclusion rate becomes insignificant for a certain outlier magnitude, which increases the probability of identification.


2019 ◽  
Author(s):  
Brandon J. Coombes ◽  
Joanna M. Biernacka

AbstractPolygenic risk scores (PRSs) have become an increasingly popular approach for demonstrating polygenic influences on complex traits and for establishing common polygenic signals between different traits. PRSs are typically constructed using pruning and thresholding (P+T), but the best choice of parameters is uncertain; thus multiple settings are used and the best is chosen. This optimization can lead to inflated type I error. To correct this, permutation procedures can be used but they can be computationally intensive. Alternatively, a single parameter setting can be chosen a priori for the PRS, but choosing suboptimal settings result in loss of power. We propose computing PRSs under a range of parameter settings, performing principal component analysis (PCA) on the resulting set of PRSs, and using the first PRS-PC in association tests. The first PC reweights the variants included in the PRS with new weights to achieve maximum variation over all PRS settings used. Using simulations, we compare the performance of the proposed PRS-PCA approach with a permutation test and a priori selection of p-value threshold. We then apply the approach to the Mayo Clinic Bipolar Disorder Biobank study to test for PRS association with psychosis using a variety of PRSs constructed from summary statistics from the largest studies of psychiatric disorders and related traits. The PRS-PCA approach is simple to implement, outperforms the other strategies in most scenarios, and provides an unbiased estimate of prediction performance. We therefore recommend it to be used PRS association studies where multiple phenotypes and/or PRSs are being investigated.


2000 ◽  
Vol 14 (1) ◽  
pp. 1-10 ◽  
Author(s):  
Joni Kettunen ◽  
Niklas Ravaja ◽  
Liisa Keltikangas-Järvinen

Abstract We examined the use of smoothing to enhance the detection of response coupling from the activity of different response systems. Three different types of moving average smoothers were applied to both simulated interbeat interval (IBI) and electrodermal activity (EDA) time series and to empirical IBI, EDA, and facial electromyography time series. The results indicated that progressive smoothing increased the efficiency of the detection of response coupling but did not increase the probability of Type I error. The power of the smoothing methods depended on the response characteristics. The benefits and use of the smoothing methods to extract information from psychophysiological time series are discussed.


Methodology ◽  
2012 ◽  
Vol 8 (1) ◽  
pp. 23-38 ◽  
Author(s):  
Manuel C. Voelkle ◽  
Patrick E. McKnight

The use of latent curve models (LCMs) has increased almost exponentially during the last decade. Oftentimes, researchers regard LCM as a “new” method to analyze change with little attention paid to the fact that the technique was originally introduced as an “alternative to standard repeated measures ANOVA and first-order auto-regressive methods” (Meredith & Tisak, 1990, p. 107). In the first part of the paper, this close relationship is reviewed, and it is demonstrated how “traditional” methods, such as the repeated measures ANOVA, and MANOVA, can be formulated as LCMs. Given that latent curve modeling is essentially a large-sample technique, compared to “traditional” finite-sample approaches, the second part of the paper addresses the question to what degree the more flexible LCMs can actually replace some of the older tests by means of a Monte-Carlo simulation. In addition, a structural equation modeling alternative to Mauchly’s (1940) test of sphericity is explored. Although “traditional” methods may be expressed as special cases of more general LCMs, we found the equivalence holds only asymptotically. For practical purposes, however, no approach always outperformed the other alternatives in terms of power and type I error, so the best method to be used depends on the situation. We provide detailed recommendations of when to use which method.


Methodology ◽  
2015 ◽  
Vol 11 (1) ◽  
pp. 3-12 ◽  
Author(s):  
Jochen Ranger ◽  
Jörg-Tobias Kuhn

In this manuscript, a new approach to the analysis of person fit is presented that is based on the information matrix test of White (1982) . This test can be interpreted as a test of trait stability during the measurement situation. The test follows approximately a χ2-distribution. In small samples, the approximation can be improved by a higher-order expansion. The performance of the test is explored in a simulation study. This simulation study suggests that the test adheres to the nominal Type-I error rate well, although it tends to be conservative in very short scales. The power of the test is compared to the power of four alternative tests of person fit. This comparison corroborates that the power of the information matrix test is similar to the power of the alternative tests. Advantages and areas of application of the information matrix test are discussed.


Sign in / Sign up

Export Citation Format

Share Document