scholarly journals Ordering of Omics Features Using Beta Distributions on Montecarlo p-Values

Mathematics ◽  
2021 ◽  
Vol 9 (11) ◽  
pp. 1307
Author(s):  
Angela L. Riffo-Campos ◽  
Guillermo Ayala ◽  
Juan Domingo

The current trend in genetic research is the study of omics data as a whole, either combining studies or omics techniques. This raises the need for new robust statistical methods that can integrate and order the relevant biological information. A good way to approach the problem is to order the features studied according to the different kinds of data so a key point is to associate good values to the features that permit us a good sorting of them. These values are usually the p-values corresponding to a hypothesis which has been tested for each feature studied. The Montecarlo method is certainly one of the most robust methods for hypothesis testing. However, a large number of simulations is needed to obtain a reliable p-value, so the method becomes computationally infeasible in many situations. We propose a new way to order genes according to their differential features by using a score defined from a beta distribution fitted to the generated p-values. Our approach has been tested using simulated data and colorectal cancer datasets from Infinium methylationEPIC array, Affymetrix gene expression array and Illumina RNA-seq platforms. The results show that this approach allows a proper ordering of genes using a number of simulations much lower than with the Montecarlo method. Furthermore, the score can be interpreted as an estimated p-value and compared with Montecarlo and other approaches like the p-value of the moderated t-tests. We have also identified a new expression pattern of eighteen genes common to all colorectal cancer microarrays, i.e., 21 datasets. Thus, the proposed method is effective for obtaining biological results using different datasets. Our score shows a slightly smaller type I error for small sizes than the Montecarlo p-value. The type II error of Montecarlo p-value is lower than the one obtained with the proposed score and with a moderated p-value, but these differences are highly reduced for larger sample sizes and higher false discovery rates. Similar performances from type I and II errors and the score enable a clear ordering of the features being evaluated.

Author(s):  
Abhaya Indrayan

Background: Small P-values have been conventionally considered as evidence to reject a null hypothesis in empirical studies. However, there is widespread criticism of P-values now and the threshold we use for statistical significance is questioned.Methods: This communication is on contrarian view and explains why P-value and its threshold are still useful for ruling out sampling fluctuation as a source of the findings.Results: The problem is not with P-values themselves but it is with their misuse, abuse, and over-use, including the dominant role they have assumed in empirical results. False results may be mostly because of errors in design, invalid data, inadequate analysis, inappropriate interpretation, accumulation of Type-I error, and selective reporting, and not because of P-values per se.Conclusion: A threshold of P-values such as 0.05 for statistical significance is helpful in making a binary inference for practical application of the result. However, a lower threshold can be suggested to reduce the chance of false results. Also, the emphasis should be on detecting a medically significant effect and not zero effect.


Methodology ◽  
2016 ◽  
Vol 12 (2) ◽  
pp. 44-51 ◽  
Author(s):  
José Manuel Caperos ◽  
Ricardo Olmos ◽  
Antonio Pardo

Abstract. Correlation analysis is one of the most widely used methods to test hypotheses in social and health sciences; however, its use is not completely error free. We have explored the frequency of inconsistencies between reported p-values and the associated test statistics in 186 papers published in four Spanish journals of psychology (1,950 correlation tests); we have also collected information about the use of one- versus two-tailed tests in the presence of directional hypotheses, and about the use of some kind of adjustment to control Type I errors due to simultaneous inference. Reported correlation tests (83.8%) are incomplete and 92.5% include an inexact p-value. Gross inconsistencies, which are liable to alter the statistical conclusions, appear in 4% of the reviewed tests, and 26.9% of the inconsistencies found were large enough to bias the results of a meta-analysis. The election of one-tailed tests and the use of adjustments to control the Type I error rate are negligible. We therefore urge authors, reviewers, and editorial boards to pay particular attention to this in order to prevent inconsistencies in statistical reports.


2017 ◽  
Author(s):  
Diptavo Dutta ◽  
Laura Scott ◽  
Michael Boehnke ◽  
Seunggeun Lee

In genetic association analysis, a joint test of multiple distinct phenotypes can increase power to identify sets of trait-associated variants within genes or regions of interest. Existing multi-phenotype tests for rare variants make specific assumptions about the patterns of association of underlying causal variants, and the violation of these assumptions can reduce power to detect association. Here we develop a general framework for testing pleiotropic effects of rare variants based on multivariate kernel regression (Multi-SKAT). Multi-SKAT models effect sizes of variants on the phenotypes through a kernel matrix and performs a variance component test of association. We show that many existing tests are equivalent to specific choices of kernel matrices with the Multi-SKAT framework. To increase power to detect association across tests with different kernel matrices, we developed a fast and accurate approximation of the significance of the minimum observed p-value across tests. To account for related individuals, our framework uses a random effects for the kinship matrix. Using simulated data and amino acid and exome-array data from the METSIM study, we show that Multi-SKAT can improve power over single-phenotype SKAT-O test and existing multiple phenotype tests, while maintaining type I error rate.


2020 ◽  
Vol 10 (1) ◽  
pp. 16
Author(s):  
Mohamed Elbassiouny ◽  
Dina Ragab ◽  
Ghada Refaat ◽  
Suhad A. Ali

Background: Colorectal cancer (CRC) is the third most common cancer in men and second in women with 1.8 million new cases (1,026,000 men and 823, 3 women) and almost 881.000 deaths. Rates are substantially higher in males than in females Worldwide in 2018.Aim of the work: In this retrospective study we aimed to evaluate the prognostic impact of baseline NLR and platelet count on the clinicopathological factors and outcome in patients of all stages Colorectal cancer treated from 1st of January 2014 to the end of December 2016 in Department of Clinical Oncology and Nuclear Medicine, Ain Shams University hospitals, Cairo, Egypt.Patients and methods: Out of 409 patient’s medical records in the GI oncology unit, Ain Shams Clinical Oncology Department were reviewed from the period between 1st of January 2014 to 30 December 2016. Total neutrophils, lymphocytic, and platelets’ counts were available for only 169 patients. Study ended in 1st of August 2018 with median period of follow up of 27.5 month, ranging between 1/1/2014 to 1/8/2018. All patients (169) were pathologically proven colorectal adenocarcinoma, with age ranging from 18-75 years old (median age: 55.5 yrs.)Results: Out of 169 patients enrolled in this study, 124 patients were resectable and underwent curative surgeries, 44 patients tumour was right located and 80 patient’s tumour located in the left sided colon. 45 patients were metastatic from the start. Postoperative Platelets ≥ 310 in our study was statistically significant regarding OS, PFS and DFS (P values <.001, <.001 and 0.007) respectively. Pre-treatment platelet revealed more frequent thrombocytosis in metastatic group than locally advanced group, yet statistically was not significant (P Value = .066). Postoperative NLR ≥ 2 was significant regarding OS, PFS and DFS among 169 enrolled patients (P values <.001, .002 and <.001) respectively. In the multivariate analysis, elevated postoperative NLR was proven as both independent prognostic and predictor factor for DFS, PFS and OAS. (sig. =.03, .03, ≤ 0.001 respectively). And platelet count is both independent prognostic factor and predictor for both PFS, OS with significance =.04, =.03 respectively).Conclusion: Abnormal NLR ratio (≥ 2) acting as a prognostic and predictor of decrease in DFS, PFS and OS in all patients groups. It also showed that abnormal platelet count (≥ 310) is prognostic and predictor of significant decrease in PFS and OS. Multidisciplinary management is needed to aware surgeons about importance of adequate lymph node dissection, our study showed a statistically significant decrease in OAS in patients underwent inadequate LNs dissection.


10.2196/21345 ◽  
2020 ◽  
Vol 22 (8) ◽  
pp. e21345 ◽  
Author(s):  
Marcus Bendtsen

When should a trial stop? Such a seemingly innocent question evokes concerns of type I and II errors among those who believe that certainty can be the product of uncertainty and among researchers who have been told that they need to carefully calculate sample sizes, consider multiplicity, and not spend P values on interim analyses. However, the endeavor to dichotomize evidence into significant and nonsignificant has led to the basic driving force of science, namely uncertainty, to take a back seat. In this viewpoint we discuss that if testing the null hypothesis is the ultimate goal of science, then we need not worry about writing protocols, consider ethics, apply for funding, or run any experiments at all—all null hypotheses will be rejected at some point—everything has an effect. The job of science should be to unearth the uncertainties of the effects of treatments, not to test their difference from zero. We also show the fickleness of P values, how they may one day point to statistically significant results; and after a few more participants have been recruited, the once statistically significant effect suddenly disappears. We show plots which we hope would intuitively highlight that all assessments of evidence will fluctuate over time. Finally, we discuss the remedy in the form of Bayesian methods, where uncertainty leads; and which allows for continuous decision making to stop or continue recruitment, as new data from a trial is accumulated.


2021 ◽  
Author(s):  
Claudia Solis-Lemus ◽  
Aaron M. Holleman ◽  
Andrei Todor ◽  
Bekh Bradley ◽  
Kerry J. Ressler ◽  
...  

Genomewide association studies increasingly employ multivariate tests of multiple correlated phenotypes to exploit likely pleiotropy to improve power. Typical multivariate methods produce a global p-value of association between a variant (or set of variants) and multiple phenotypes. When the global test is significant, subsequent interest then focuses on dissecting the signal and, in particular, delineating the set of phenotypes where the genetic variant(s) have a direct effect from the remaining phenotypes where the genetic variant(s) possess either indirect or no effect. While existing techniques like mediation models can be utilized for this purpose, they generally cannot handle high-dimensional phenotypic and genotypic data. To assist in filling this important gap, we propose a modification of a kernel distance-covariance framework for gene mapping of multiple variants with multiple phenotypes to test instead whether the association between the variants and a group of phenotypes is driven through a direct association with just a subset of the phenotypes. We use simulated data to show that our new method controls for type I error and is powerful to detect a variety of models demonstrating different patterns of direct and indirect effects. We further illustrate our method using GWAS data from the Grady Trauma Project and show that an existing signal between genetic variants in the ZHX2 gene and 21 items within the Beck Depression Inventory appears to be due to a direct effect of these variants on only 3 of these items. Our approach scales to genomewide analysis, and is applicable to high-dimensional correlated phenotypes.


2021 ◽  
Author(s):  
◽  
Ellen Fitzsimmons

The posterior predictive p-value (ppp-value) is currently the primary measure of fit for Bayesian SEM. It is a measure of discrepancy between observed data and a posited model, comparing an observed likelihood ratio test (LRT) statistic to the posterior distribution of LRT statistics under a fitted model. However, the LRT statistic requires a likelihood, and multiple likelihoods are available for a given SEM: we can use a marginal likelihood that integrates out the latent variable(s), or we can use a conditional likelihood that conditions on the latent variable(s). A ppp-value based on conditional likelihoods is unexplored in the SEM literature, so the goal of this project is to study its performance alongside the marginal ppp-value. We present comparisons of the marginal and conditional ppp-values using real and simulated data, leading to recommendations on uses of the metrics in practice.


2020 ◽  
Vol 21 (15) ◽  
pp. 5395 ◽  
Author(s):  
Dominique Scherer ◽  
Heike Deutelmoser ◽  
Yesilda Balavarca ◽  
Reka Toth ◽  
Nina Habermann ◽  
...  

An individual’s inherited genetic variation may contribute to the ‘angiogenic switch’, which is essential for blood supply and tumor growth of microscopic and macroscopic tumors. Polymorphisms in angiogenesis-related genes potentially predispose to colorectal cancer (CRC) or affect the survival of CRC patients. We investigated the association of 392 single nucleotide polymorphisms (SNPs) in 33 angiogenesis-related genes with CRC risk and survival of CRC patients in 1754 CRC cases and 1781 healthy controls within DACHS (Darmkrebs: Chancen der Verhütung durch Screening), a German population-based case-control study. Odds ratios and 95% confidence intervals (CI) were estimated from unconditional logistic regression to test for genetic associations with CRC risk. The Cox proportional hazard model was used to estimate hazard ratios (HR) and 95% CIs for survival. Multiple testing was adjusted for by a false discovery rate. No variant was associated with CRC risk. Variants in EFNB2, MMP2 and JAG1 were significantly associated with overall survival. The association of the EFNB2 tagging SNP rs9520090 (p < 0.0001) was confirmed in two validation datasets (p-values: 0.01 and 0.05). The associations of the tagging SNPs rs6040062 in JAG1 (p-value 0.0003) and rs2241145 in MMP2 (p-value 0.0005) showed the same direction of association with overall survival in the first and second validation sets, respectively, although they did not reach significance (p-values: 0.09 and 0.25, respectively). EFNB2, MMP2 and JAG1 are known for their functional role in angiogenesis and the present study points to novel evidence for the impact of angiogenesis-related genetic variants on the CRC outcome.


2019 ◽  
Author(s):  
Dimitri Marques Abramov

AbstractBackgroundMethods for p-value correction are criticized for either increasing Type II error or improperly reducing Type I error. This problem is worse when dealing with hundreds or thousands of paired comparisons between waves or images which are performed point-to-point. This text considers patterns in probability vectors resulting from multiple point-to-point comparisons between two ERP waves (mass univariate analysis) to correct p-values. These patterns (probability waves) mirror ERP waveshapes and might be indicators of consistency in statistical differences.New methodIn order to compute and analyze these patterns, we convoluted the decimal logarithm of the probability vector (p’) using a Gaussian vector with size compatible to the ERP periods observed. For verify consistency of this method, we also calculated mean amplitudes of late ERPs from Pz (P300 wave) and O1 electrodes in two samples, respectively of typical and ADHD subjects.Resultsthe present method reduces the range of p’-values that did not show covariance with neighbors (that is, that are likely random differences, type I errors), while preserving the amplitude of probability waves, in accordance to difference between respective mean amplitudes.Comparison with existing methodsthe positive-FDR resulted in a different profile of corrected p-values, which is not consistent with expected results or differences between mean amplitudes of the analyzed ERPs.Conclusionthe present new method seems to be biological and statistically more suitable to correct p-values in mass univariate analysis of ERP waves.


2021 ◽  
Author(s):  
Marcos A. Antezana

ABSTRACTWhen a data matrix DM has many independent variables IVs, it is not computationally tractable to assess the association of every distinct IV subset with the dependent variable DV of the DM, because the number of subsets explodes combinatorially as IVs increase. But model selection and correcting for multiple tests is complex even with few IVs.DMs in genomics will soon summarize millions of markers (mutations) and genomes. Searching exhaustively in such DMs for mutations that alone or synergistically with others are associated with a trait is computationally tractable only for 1- and 2-mutation effects. This is also why population geneticists study mainly 2-marker combinations.I present a computationally tractable, fully parallelizable Participation in Association Score (PAS) that in a DM with markers detects one by one every column that is strongly associated in any way with others. PAS does not examine column subsets and its computational cost grows linearly with the number of columns, remaining reasonable even when DMs have millions of columns. PAS P values are readily obtained by permutation and accurately Sidak-corrected for multiple tests, bypassing model selection. The P values of a column’s PASs and dvPASs for different orders of association are i.i.d. and easily turned into a single P value.PAS exploits how associations of markers in the rows of a DM cause associations of matches in the pairwise comparisons of the rows. For every such comparison with a match at a tested column, PAS computes the matches at other columns by modifying the comparison’s total matches (scored once per DM), yielding a distribution of conditional matches that reacts diagnostically to the associations of the tested column. Equally computationally tractable is dvPAS that flags DV-associated IVs by also probing the matches at the DV.Simulations show that i) PAS and dvPAS generate uniform-(0,1)-distributed type I error in null DMs and ii) detect randomly encountered binary and trinary models of significant n-column association and n-IV association to a binary DV, respectively, with power in the order of magnitude of exhaustive evaluation’s and false positives that are uniform-(0,1)-distributed or straightforwardly tuned to be so. Power to detect 2-way associations that extend over 100+ columns is non-parametrically ultimate but that to detect pure n-column associations and pure n-IV DV associations sinks exponentially with increasing n.Important for geneticists, dvPAS power increases about twofold in trinary vs. binary DMs and by orders of magnitude with markers linked like mutations in chromosomes, specially in trinary DMs where furthermore dvPAS fine-maps with highest resolution.


Sign in / Sign up

Export Citation Format

Share Document