scholarly journals onlineFDR: an R package to control the false discovery rate for growing data repositories

2019 ◽  
Vol 35 (20) ◽  
pp. 4196-4199 ◽  
Author(s):  
David S Robertson ◽  
Jan Wildenhain ◽  
Adel Javanmard ◽  
Natasha A Karp

Abstract Summary In many areas of biological research, hypotheses are tested in a sequential manner, without having access to future P-values or even the number of hypotheses to be tested. A key setting where this online hypothesis testing occurs is in the context of publicly available data repositories, where the family of hypotheses to be tested is continually growing as new data is accumulated over time. Recently, Javanmard and Montanari proposed the first procedures that control the FDR for online hypothesis testing. We present an R package, onlineFDR, which implements these procedures and provides wrapper functions to apply them to a historic dataset or a growing data repository. Availability and implementation The R package is freely available through Bioconductor (http://www.bioconductor.org/packages/onlineFDR). Supplementary information Supplementary data are available at Bioinformatics online.

2019 ◽  
Vol 35 (17) ◽  
pp. 3184-3186
Author(s):  
Xiao-Fei Zhang ◽  
Le Ou-Yang ◽  
Shuo Yang ◽  
Xiaohua Hu ◽  
Hong Yan

Abstract Summary To identify biological network rewiring under different conditions, we develop a user-friendly R package, named DiffNetFDR, to implement two methods developed for testing the difference in different Gaussian graphical models. Compared to existing tools, our methods have the following features: (i) they are based on Gaussian graphical models which can capture the changes of conditional dependencies; (ii) they determine the tuning parameters in a data-driven manner; (iii) they take a multiple testing procedure to control the overall false discovery rate; and (iv) our approach defines the differential network based on partial correlation coefficients so that the spurious differential edges caused by the variants of conditional variances can be excluded. We also develop a Shiny application to provide easier analysis and visualization. Simulation studies are conducted to evaluate the performance of our methods. We also apply our methods to two real gene expression datasets. The effectiveness of our methods is validated by the biological significance of the identified differential networks. Availability and implementation R package and Shiny app are available at https://github.com/Zhangxf-ccnu/DiffNetFDR. Supplementary information Supplementary data are available at Bioinformatics online.


2021 ◽  
Author(s):  
Taavi Päll ◽  
Hannes Luidalepp ◽  
Tanel Tenson ◽  
Ülo Maiväli

AbstractHere we assess reproducibility and inferential quality in the field of differential HT-seq, based on analysis of datasets submitted 2008-2019 to the NCBI GEO data repository. Analysis of GEO submission file structures places an overall 59% upper limit to reproducibility. We further show that only 23% of experiments resulted in theoretically expected p value histogram shapes, although both reproducibility and p value distributions show marked improvement over time. Uniform p value histogram shapes, indicative of <100 true effects, were extremely few. Our calculations of π0, the fraction of true nulls, showed that 36% of experiments have π0 <0.5, meaning that in over a third of experiments most RNA-s were estimated to change their expression level upon experimental treatment. Both the fraction of different p value histogram types and π0 values are strongly associated with the software used for calculating these p values by the original authors, indicating widespread bias.


Author(s):  
Zongli Xu ◽  
Changchun Xie ◽  
Jack A Taylor ◽  
Liang Niu

Abstract Summary ipDMR is an R software tool for identification of differentially methylated regions (DMRs) using auto-correlated P-values for individual CpGs from epigenome-wide association analysis using array or bisulfite sequencing data. It summarizes P-values for adjacent CpGs, identifies association peaks and then extends peaks to find boundaries of DMRs. ipDMR uses BED format files as input and is easy to use. Simulations guided by real data found that ipDMR outperformed current available methods and provided slightly higher true positive rates and much lower false discovery rates. Availability and implementation ipDMR is available at https://bioconductor.org/packages/release/bioc/html/ENmix.html. Supplementary information Supplementary data are available at Bioinformatics online.


2019 ◽  
Vol 35 (19) ◽  
pp. 3592-3598 ◽  
Author(s):  
Justin G Chitpin ◽  
Aseel Awdeh ◽  
Theodore J Perkins

Abstract Motivation Chromatin Immunopreciptation (ChIP)-seq is used extensively to identify sites of transcription factor binding or regions of epigenetic modifications to the genome. A key step in ChIP-seq analysis is peak calling, where genomic regions enriched for ChIP versus control reads are identified. Many programs have been designed to solve this task, but nearly all fall into the statistical trap of using the data twice—once to determine candidate enriched regions, and again to assess enrichment by classical statistical hypothesis testing. This double use of the data invalidates the statistical significance assigned to enriched regions, thus the true significance or reliability of peak calls remains unknown. Results Using simulated and real ChIP-seq data, we show that three well-known peak callers, MACS, SICER and diffReps, output biased P-values and false discovery rate estimates that can be many orders of magnitude too optimistic. We propose a wrapper algorithm, RECAP, that uses resampling of ChIP-seq and control data to estimate a monotone transform correcting for biases built into peak calling algorithms. When applied to null hypothesis data, where there is no enrichment between ChIP-seq and control, P-values recalibrated by RECAP are approximately uniformly distributed. On data where there is genuine enrichment, RECAP P-values give a better estimate of the true statistical significance of candidate peaks and better false discovery rate estimates, which correlate better with empirical reproducibility. RECAP is a powerful new tool for assessing the true statistical significance of ChIP-seq peak calls. Availability and implementation The RECAP software is available through www.perkinslab.ca or on github at https://github.com/theodorejperkins/RECAP. Supplementary information Supplementary data are available at Bioinformatics online.


Author(s):  
Xiaoyu Liang ◽  
Ying Hu ◽  
Chunhua Yan ◽  
Ke Xu

Abstract Motivation High-quality imaging analyses have been proposed to drive innovation in biomedical and biological research. However, the application of images remains underexploited because of the limited capacity of human vision and the challenges in extracting quantitative information from images. Computationally extracting quantitative information from images is critical to overcoming this limitation. Here, we present a novel R package, i2d, to simulate data from an image based on digital convolution. Results The R package i2d allows users to transform an image into a simulated dataset that can be used to extract and analyze complex information in biomedical and biological research. The package also includes three novel and efficient methods for graph clustering based on simulated data, which can be used to dissect complex gene networks into sub-clusters that have similar biological functions. Availability and implementation The code, the documentation, a tutorial and example data are available on an open source at: github.com/XiaoyuLiang/i2d. Supplementary information Supplementary data are available at Bioinformatics online.


2021 ◽  
Author(s):  
Isaac Fink ◽  
Richard J. Abdill ◽  
Ran Blekhman ◽  
Laura Grieneisen

AbstractSummaryA key aspect of microbiome research is analysis of longitudinal dynamics using time series data. A method to visualize both the proportional and absolute change in the abundance of multiple taxa across multiple subjects over time is needed. We developed BiomeHorizon, an open-source R package that visualizes longitudinal compositional microbiome data using horizon plots.Availability and ImplementationBiomeHorizon is available at https://github.com/blekhmanlab/biomehorizon/ and released under the MIT license. A guide with step-by-step instructions for using the package is provided at https://blekhmanlab.github.io/biomehorizon/. The guide also provides code to reproduce all plots in this [email protected], [email protected], [email protected] informationNone


2019 ◽  
Vol 36 (1) ◽  
pp. 177-185
Author(s):  
John Ferguson ◽  
Joseph Chang

Abstract Motivation In bioinformatics, genome-wide experiments look for important biological differences between two groups at a large number of locations in the genome. Often, the final analysis focuses on a P-value-based ranking of locations which might then be investigated further in follow-up experiments. However, this strategy may result in small effect sizes, with low P-values, being ranked more favorably than larger more scientifically important effects. Bayesian ranking techniques may offer a solution to this problem provided a good prior distribution for the collective distribution of effect sizes is available. Results We develop an Empirical Bayes ranking algorithm, using the marginal distribution of the data over all locations to estimate an appropriate prior. In simulations and analysis using real datasets, we demonstrate favorable performance compared to ordering P-values and a number of other competing ranking methods. The algorithm is computationally efficient and can be used to rank the entirety of genomic locations or to rank a subset of locations, pre-selected via traditional FWER/FDR methods in a 2-stage analysis. Availability and implementation An R-package, EBrank, implementing the ranking algorithm is available on CRAN. Supplementary information Supplementary data are available at Bioinformatics online.


2021 ◽  
pp. 096228022098338
Author(s):  
Jinjin Tian ◽  
Aaditya Ramdas

Biological research often involves testing a growing number of null hypotheses as new data are accumulated over time. We study the problem of online control of the familywise error rate, that is testing an a priori unbounded sequence of hypotheses ( p-values) one by one over time without knowing the future, such that with high probability there are no false discoveries in the entire sequence. This paper unifies algorithmic concepts developed for offline (single batch) familywise error rate control and online false discovery rate control to develop novel online familywise error rate control methods. Though many offline familywise error rate methods (e.g., Bonferroni, fallback procedures and Sidak’s method) can trivially be extended to the online setting, our main contribution is the design of new, powerful, adaptive online algorithms that control the familywise error rate when the p-values are independent or locally dependent in time. Our numerical experiments demonstrate substantial gains in power, that are also formally proved in an idealized Gaussian sequence model. A promising application to the International Mouse Phenotyping Consortium is described.


Author(s):  
Peter Hettegger ◽  
Klemens Vierlinger ◽  
Andreas Weinhaeusel

Abstract Motivation Data generated from high-throughput technologies such as sequencing, microarray and bead-chip technologies are unavoidably affected by batch effects (BEs). Large effort has been put into developing methods for correcting these effects. Often, BE correction and hypothesis testing cannot be done with one single model, but are done successively with separate models in data analysis pipelines. This potentially leads to biased P-values or false discovery rates due to the influence of BE correction on the data. Results We present a novel approach for estimating null distributions of test statistics in data analysis pipelines where BE correction is followed by linear model analysis. The approach is based on generating simulated datasets by random rotation and thereby retains the dependence structure of genes adequately. This allows estimating null distributions of dependent test statistics, and thus the calculation of resampling-based P-values and false-discovery rates following BE correction while maintaining the alpha level. Availability The described methods are implemented as randRotation package on Bioconductor: https://bioconductor.org/packages/randRotation/ Supplementary information Supplementary data are available at Bioinformatics online.


1993 ◽  
Vol 69 (01) ◽  
pp. 021-024 ◽  
Author(s):  
Shawn Tinlin ◽  
Sandra Webster ◽  
Alan R Giles

SummaryThe development of inhibitors to factor VIII in patients with haemophilia A remains as a serious complication of replacement therapy. An apparently analogous condition has been described in a canine model of haemophilia A (Giles et al., Blood 1984; 63:451). These animals and their relatives have now been followed for 10 years. The observation that the propensity for inhibitor development was not related to the ancestral factor VIII gene has been confirmed by the demonstration of vertical transmission through three generations of the segment of the family related to a normal (non-carrier) female that was introduced for breeding purposes. Haemophilic animals unrelated to this animal have not developed functionally significant factor VIII inhibitors despite intensive factor VIII replacement. Two animals have shown occasional laboratory evidence of factor VIII inhibition but this has not been translated into clinical significant inhibition in vivo as assessed by clinical response and F.VIII recovery and survival characteristics. Substantial heterogeneity of inhibitor expression both in vitro and in vivo has been observed between animals and in individual animals over time. Spontaneous loss of inhibitors has been observed without any therapies designed to induce tolerance, etc., being instituted. There is also phenotypic evidence of polyclonality of the immune response with variable expression over time in a given animal. These observations may have relevance to the human condition both in determining the pathogenetic factors involved in this condition and in highlighting the heterogeneity of its expression which suggests the need for caution in the interpretation of the outcome of interventions designed to modulate inhibitor activity.


Sign in / Sign up

Export Citation Format

Share Document