scholarly journals A large-sample crisis? Exaggerated false positives by popular differential expression methods

Author(s):  
Yumei Li ◽  
Xinzhou Ge ◽  
Fanglue Peng ◽  
Wei Li ◽  
Jingyi Jessica Li

AbstractWe report a surprising phenomenon about identifying differentially expressed genes (DEGs) from population-level RNA-seq data: two popular bioinformatics methods, DESeq2 and edgeR, have unexpectedly high false discovery rates (FDRs). Via permutation analysis on an immunotherapy RNA-seq dataset, we observed that DESeq2 and edgeR identified even more DEGs after samples’ condition labels were randomly permuted. Motivated by this, we evaluated six DEG identification methods (DESeq2, edgeR, limma-voom, NOISeq, dearseq, and the Wilcoxon rank-sum test) on population-level RNA-seq datasets. We found that the FDR control was often failed by the three popular parametric methods—DESeq2, edgeR, and limma-voom— and the new non-parametric method dearseq. In particular, the actual FDRs of DESeq2 and edgeR sometimes exceeded 20% when the target FDR threshold was only 5%. Although NOISeq, a non-parametric method used by GTEx, controlled the FDR better than the other four methods did, its power was much lower than that of the Wilcoxon rank-sum test, a classic nonparametric test that consistently controlled the FDR and achieved good power in our evaluation. Based on these results, for population-level RNA-seq studies, we recommend the Wilcoxon rank-sum test.

2018 ◽  
Author(s):  
James Liley ◽  
Chris Wallace

AbstractA common aim in high-dimensional association studies is the identification of the subset of investigated variables associated with a trait of interest. Using association statistics on the same variables for a second related trait can improve power. An important quantity in such analyses is the conditional false-discovery rate (cFDR), the probability of non-association with the trait of interest given p-value thresholds for both traits. The cFDR can be used for hypothesis testing and as a posterior probability in its own right. In this paper, we propose new estimators for the cFDR based on kernel density estimates and mixture-Gaussian models of effect sizes, the latter also allowing estimation of a ‘local’ form of cFDR (cfdr). We also propose a general non-parametric improvement to existing estimators based on estimating a posterior probability previously estimated at 1. We find that new estimators have the desirable property of smooth rejection regions, but, unexpectedly, do not improve the power of the method, even when distributional assumptions are true. Furthermore, we find that although the local cfdr represents a theoretically optimal decision boundary, noisiness in its estimation means it is less powerful than corresponding cFDR estimates. We find, however, that the non-parametric adjustment increases power for every estimator. We demonstrate the best method on transcriptome-wide association study datasets for breast and ovarian cancers. The findings from this analysis are of both theoretical and pragmatic interest, giving insight into the nature of cFDR and the behaviour of false-discovery rates in a two-dimensional setting. Our methods allow improved control over the behaviour of the cFDR estimator and improved power in high-dimensional hypothesis testing.


2013 ◽  
Author(s):  
Alexander Dobin ◽  
Thomas R Gingeras

In the recent paper by Kim et al. (Genome biology, 2013. 14(4): p. R36) the accuracy of TopHat2 was compared to other RNA-seq aligners. In this comment we re-examine most important analyses from this paper and identify several deficiencies that significantly diminished performance of some of the aligners, including incorrect choice of mapping parameters, unfair comparison metrics, and unrealistic simulated data. Using STAR (Dobin et al., Bioinformatics, 2013. 29(1): p. 15-21) as an exemplar, we demonstrate that correcting these deficiencies makes its accuracy equal or better than that of TopHat2. Furthermore, this exercise highlighted some serious issues with the TopHat2 algorithms, such as poor recall of alignments with a moderate (>3) number of mismatches, low sensitivity and high false discovery rate for splice junction detection, loss of precision for the realignment algorithm, and large number of false chimeric alignments.


2015 ◽  
Author(s):  
Swati Parekh ◽  
Christoph Ziegenhain ◽  
Beate Vieth ◽  
Wolfgang Enard ◽  
Ines Hellmann

Background Currently quantitative RNA-Seq methods are pushed to work with increasingly small starting amounts of RNA that require PCR amplification to generate libraries. However, it is unclear how much noise or bias amplification introduces and how this effects precision and accuracy of RNA quantification. To assess the effects of amplification, reads that originated from the same RNA molecule (PCR-duplicates) need to be identified. Computationally, read duplicates are defined via their mapping position, which does not distinguish PCR- from natural duplicates that are bound to occur for highly transcribed RNAs. Hence, it is unclear how to treat duplicate reads and how important it is to reduce PCR amplification experimentally. Here, we generate and analyse RNA-Seq datasets that were prepared with three different protocols (Smart-Seq, TruSeq and UMI-seq). We find that a large fraction of computationally identified read duplicates can be explained by sampling and fragmentation bias. Consequently, the computational removal of duplicates does not improve accuracy, power or false discovery rates, but can actually worsen them. Even when duplicates are experimentally identified by unique molecular identifiers (UMIs), power and false discovery rate are only mildly improved. However, we do find that power does improve with fewer PCR amplification cycles across datasets and that early barcoding of samples and hence PCR amplification in one reaction can restore this loss of power. Conclusions Computational removal of read duplicates is not recommended for differential expression analysis. However, the pooling of samples as made possible by the early barcoding of the UMI-protocol leads to an appreciable increase in the power to detect differentially expressed genes.


2019 ◽  
Vol 21 (6) ◽  
pp. 2052-2065 ◽  
Author(s):  
Arfa Mehmood ◽  
Asta Laiho ◽  
Mikko S Venäläinen ◽  
Aidan J McGlinchey ◽  
Ning Wang ◽  
...  

Abstract Differential splicing (DS) is a post-transcriptional biological process with critical, wide-ranging effects on a plethora of cellular activities and disease processes. To date, a number of computational approaches have been developed to identify and quantify differentially spliced genes from RNA-seq data, but a comprehensive intercomparison and appraisal of these approaches is currently lacking. In this study, we systematically evaluated 10 DS analysis tools for consistency and reproducibility, precision, recall and false discovery rate, agreement upon reported differentially spliced genes and functional enrichment. The tools were selected to represent the three different methodological categories: exon-based (DEXSeq, edgeR, JunctionSeq, limma), isoform-based (cuffdiff2, DiffSplice) and event-based methods (dSpliceType, MAJIQ, rMATS, SUPPA). Overall, all the exon-based methods and two event-based methods (MAJIQ and rMATS) scored well on the selected measures. Of the 10 tools tested, the exon-based methods performed generally better than the isoform-based and event-based methods. However, overall, the different data analysis tools performed strikingly differently across different data sets or numbers of samples.


Mathematics ◽  
2021 ◽  
Vol 9 (11) ◽  
pp. 1169
Author(s):  
Juan Bógalo ◽  
Pilar Poncela ◽  
Eva Senra

Real-time monitoring of the economy is based on activity indicators that show regular patterns such as trends, seasonality and business cycles. However, parametric and non-parametric methods for signal extraction produce revisions at the end of the sample, and the arrival of new data makes it difficult to assess the state of the economy. In this paper, we compare two signal extraction procedures: Circulant Singular Spectral Analysis, CiSSA, a non-parametric technique in which we can extract components associated with desired frequencies, and a parametric method based on ARIMA modelling. Through a set of simulations, we show that the magnitude of the revisions produced by CiSSA converges to zero quicker, and it is smaller than that of the alternative procedure.


Forecasting ◽  
2020 ◽  
Vol 3 (1) ◽  
pp. 1-16
Author(s):  
Hassan Hamie ◽  
Anis Hoayek ◽  
Hans Auer

The question of whether the liberalization of the gas industry has led to less concentrated markets has attracted much interest among the scientific community. Classical mathematical regression tools, statistical tests, and optimization equilibrium problems, more precisely non-linear complementarity problems, were used to model European gas markets and their effect on prices. In this research, the parametric and nonparametric game theory methods are employed to study the effect of the market concentration on gas prices. The parametric method takes into account the classical Cournot equilibrium test, with assumptions on cost and demand functions. However, the non-parametric method does not make any prior assumptions, a factor that allows greater freedom in modeling. The results of the parametric method demonstrate that the gas suppliers’ behavior in Austria and The Netherlands gas markets follows the Nash–Cournot equilibrium, where companies act rationally to maximize their payoffs. The non-parametric approach validates the fact that suppliers in both markets follow the same behavior even though one market is more liquid than the other. Interestingly, our findings also suggest that some of the gas suppliers maximize their ‘utility function’ not by only relying on profit, but also on some type of non-profit objective, and possibly collusive behavior.


2011 ◽  
Vol 5 (21) ◽  
pp. 8678-8685
Author(s):  
de Souza Lima Vitor ◽  
Carlos C B Soares de Mello Joatilde o ◽  
Angulo Meza Lidia

Sign in / Sign up

Export Citation Format

Share Document