scholarly journals Weighted mining of massive collections of P-values by convex optimization

2017 ◽  
Vol 7 (2) ◽  
pp. 251-275
Author(s):  
Edgar Dobriban

Abstract Researchers in data-rich disciplines—think of computational genomics and observational cosmology—often wish to mine large bodies of $P$-values looking for significant effects, while controlling the false discovery rate or family-wise error rate. Increasingly, researchers also wish to prioritize certain hypotheses, for example, those thought to have larger effect sizes, by upweighting, and to impose constraints on the underlying mining, such as monotonicity along a certain sequence. We introduce Princessp, a principled method for performing weighted multiple testing by constrained convex optimization. Our method elegantly allows one to prioritize certain hypotheses through upweighting and to discount others through downweighting, while constraining the underlying weights involved in the mining process. When the $P$-values derive from monotone likelihood ratio families such as the Gaussian means model, the new method allows exact solution of an important optimal weighting problem previously thought to be non-convex and computationally infeasible. Our method scales to massive data set sizes. We illustrate the applications of Princessp on a series of standard genomics data sets and offer comparisons with several previous ‘standard’ methods. Princessp offers both ease of operation and the ability to scale to extremely large problem sizes. The method is available as open-source software from github.com/dobriban/pvalue_weighting_matlab (accessed 11 October 2017).

2004 ◽  
Vol 3 (1) ◽  
pp. 1-16 ◽  
Author(s):  
Kerby Shedden

A common experimental strategy utilizing microarrays is to develop a signature of genes responding to some treatment in a model system, and then ask whether the same genes respond in an analogous way in a more natural and uncontrolled environment. In statistical terms, the question posed is whether genes score similarly on some statistical test in two independent data sets. Approaches to this problem ignoring gene/gene correlations common to all microarray data sets are known to give overstated statistical confidence levels. Permutation approaches have been proposed to give more accurate confidence levels, but can not be applied when sample sizes are small. Here we argue that the product moment correlation between test statistics in the two experiments is an ideal measure for summarizing concordance between the experiments, as confidence levels accounting for intergene correlations depend only on a single number -- the average squared correlation between gene pairs in the data set. The resulting null standard deviation is shown to vary by less than a factor of two over six distinct experimental data sets, suggesting that a universal constant may be used for this quantity. We show how a hidden assumption of the permutation approach may lead to incorrect p-values, while the analytic approach presented here is shown to be resistant to this assumption.


2018 ◽  
Author(s):  
Justin G. Chitpin ◽  
Aseel Awdeh ◽  
Theodore J. Perkins

AbstractMotivationChlP-seq is used extensively to identify sites of transcription factor binding or regions of epigenetic modifications to the genome. A key step in ChIP-seq analysis is peak calling, where genomic regions enriched for ChIP versus control reads are identified. Many programs have been designed to solve this task, but nearly all fall into the statistical trap of using the data twice—once to determine candidate enriched regions, and again to assess enrichment by classical statistical hypothesis testing. This double use of the data invalidates the statistical significance assigned to enriched regions, and as a consequence, invalidates false discovery rate estimates. Thus, the true significance or reliability of peak calls remains unknown.ResultsUsing simulated and real ChIP-seq data sets, we show that three well-known peak callers, MACS, SICER and diffReps, output optimistically biased p-values, and therefore optimistic false discovery rate estimates—in some cases, many orders of magnitude too optimistic. We propose a wrapper algorithm, RECAP, that uses resampling of ChIP-seq and control data to estimate and correct for biases built into peak calling algorithms. P-values recalibrated by RECAP are approximately uniformly distributed when applied to null hypothesis data, in which ChIP-seq and control come from the same genomic distributions. When applied to non-null data, RECAP p-values give a better estimate of the true statistical significance of candidate peaks and better false discovery rate estimates, which correlate better with empirical reproducibility. RECAP is a powerful new tool for assessing the true statistical significance of ChIP-seq peak calls.AvailabilityThe RECAP software is available on github at https://github.com/theodorejperkins/[email protected]


2016 ◽  
Vol 8 (1) ◽  
Author(s):  
Sesha K. Dassanayaka ◽  
Joshua French

We present a simple, fast, and easily interpretable procedure that results in faster detection of outbreaks in multiple spatial regions. Disease counts from neighboring regions are aggregated to compute a Poisson CUSUM statistic for each region. Instead of controlling the average run length error criterion in the testing process, we instead utilize the false discovery rate. Additionally, p-values are used to make decisions instead of traditional critical-values. The use of the false discovery rate and p-values in testing allows us to utilize more powerful multiple testing methodologies. The procedure is successfully applied to detect the 2011 Salmonella Newport outbreak in Germany.


2015 ◽  
Author(s):  
Halit Ongen ◽  
Alfonso Buil ◽  
Andrew Brown ◽  
Emmanouil Dermitzakis ◽  
Olivier Delaneau

Motivation: In order to discover quantitative trait loci (QTLs), multi-dimensional genomic data sets combining DNA-seq and ChiP-/RNA-seq require methods that rapidly correlate tens of thousands of molecular phenotypes with millions of genetic variants while appropriately controlling for multiple testing. Results: We have developed FastQTL, a method that implements a popular cis-QTL mapping strategy in a user- and cluster-friendly tool. FastQTL also proposes an efficient permutation procedure to control for multiple testing. The outcome of permutations is modeled using beta distributions trained from a few permutations and from which adjusted p-values can be estimated at any level of significance with little computational cost. The Geuvadis & GTEx pilot data sets can be now easily analyzed an order of magnitude faster than previous approaches. Availability: Source code, binaries and comprehensive documentation of FastQTL are freely available to download at http://fastqtl.sourceforge.net/


2021 ◽  
Vol 11 (2) ◽  
Author(s):  
Cynthia Dwork ◽  
Weijie Su ◽  
Li Zhang

Differential privacy provides a rigorous framework for privacy-preserving data analysis. This paper proposes the first differentially private procedure for controlling the false discovery rate (FDR) in multiple hypothesis testing. Inspired by the Benjamini-Hochberg procedure (BHq), our approach is to first repeatedly add noise to the logarithms of the p-values to ensure differential privacy and to select an approximately smallest p-value serving as a promising candidate at each iteration; the selected p-values are further supplied to the BHq and our private procedure releases only the rejected ones. Moreover, we develop a new technique that is based on a backward submartingale for proving FDR control of a broad class of multiple testing procedures, including our private procedure, and both the BHq step- up and step-down procedures. As a novel aspect, the proof works for arbitrary dependence between the true null and false null test statistics, while FDR control is maintained up to a small multiplicative factor.


2018 ◽  
Vol 617 ◽  
pp. A136 ◽  
Author(s):  
A. Neld ◽  
C. Horellou ◽  
D. D. Mulcahy ◽  
R. Beck ◽  
S. Bourke ◽  
...  

Context. The new generation of broad-band radio continuum surveys will provide large data sets with polarization information. New algorithms need to be developed to extract reliable catalogs of linearly polarized sources that can be used to characterize those sources and produce a dense rotation measure (RM) grid to probe magneto-ionized structures along the line of sight via Faraday rotation. Aims. The aim of the paper is to develop a computationally efficient and rigorously defined source-finding algorithm for linearly polarized sources. Methods. We used a calibrated data set from the LOw Frequency ARray (LOFAR) at 150 MHz centered on the nearby galaxy M 51 to search for polarized background sources. With a new imaging software, we re-imaged the field at a resolution of 18″ × 15″ and cataloged a total of about 3000 continuum sources within 2.5° of the center of M 51. We made small Stokes Q and U images centered on each source brighter than 100 mJy in total intensity (201 sources) and used RM synthesis to create corresponding Faraday cubes that were analyzed individually. For each source, the noise distribution function was determined from a subset of the measurements at high Faraday depths where no polarization is expected; the peaks in polarized intensity in the Faraday spectrum were identified and the p-value of each source was calculated. Finally, the false discovery rate method was applied to the list of p-values to produce a list of polarized sources and quantify the reliability of the detections. We also analyzed sources fainter than 100 mJy but that were reported as polarized in the literature at at least another radio frequency. Results. Of the 201 sources that were searched for polarization, six polarized sources were detected confidently (with a false discovery rate of 5%). This corresponds to a number density of one polarized source per 3.3 square degrees, or 0.3 source per square degree. Increasing the false discovery rate to 50% yields 19 sources. A majority of the sources have a morphology that is indicative of them being double-lobed radio galaxies, and the ones with literature redshift measurements have 0.5 < z < 1.0. Conclusions. We find that this method is effective in identifying polarized sources, and is well suited for LOFAR observations. In the future, we intend to develop it further and apply it to larger data sets such as the LOFAR Two-meter Survey of the whole northern sky, LOTSS, and the ongoing deep LOFAR observations of the GOODS-North field.


2018 ◽  
Vol 154 (2) ◽  
pp. 149-155
Author(s):  
Michael Archer

1. Yearly records of worker Vespula germanica (Fabricius) taken in suction traps at Silwood Park (28 years) and at Rothamsted Research (39 years) are examined. 2. Using the autocorrelation function (ACF), a significant negative 1-year lag followed by a lesser non-significant positive 2-year lag was found in all, or parts of, each data set, indicating an underlying population dynamic of a 2-year cycle with a damped waveform. 3. The minimum number of years before the 2-year cycle with damped waveform was shown varied between 17 and 26, or was not found in some data sets. 4. Ecological factors delaying or preventing the occurrence of the 2-year cycle are considered.


2018 ◽  
Vol 21 (2) ◽  
pp. 117-124 ◽  
Author(s):  
Bakhtyar Sepehri ◽  
Nematollah Omidikia ◽  
Mohsen Kompany-Zareh ◽  
Raouf Ghavami

Aims & Scope: In this research, 8 variable selection approaches were used to investigate the effect of variable selection on the predictive power and stability of CoMFA models. Materials & Methods: Three data sets including 36 EPAC antagonists, 79 CD38 inhibitors and 57 ATAD2 bromodomain inhibitors were modelled by CoMFA. First of all, for all three data sets, CoMFA models with all CoMFA descriptors were created then by applying each variable selection method a new CoMFA model was developed so for each data set, 9 CoMFA models were built. Obtained results show noisy and uninformative variables affect CoMFA results. Based on created models, applying 5 variable selection approaches including FFD, SRD-FFD, IVE-PLS, SRD-UVEPLS and SPA-jackknife increases the predictive power and stability of CoMFA models significantly. Result & Conclusion: Among them, SPA-jackknife removes most of the variables while FFD retains most of them. FFD and IVE-PLS are time consuming process while SRD-FFD and SRD-UVE-PLS run need to few seconds. Also applying FFD, SRD-FFD, IVE-PLS, SRD-UVE-PLS protect CoMFA countor maps information for both fields.


Sign in / Sign up

Export Citation Format

Share Document