DiffNetFDR: differential network analysis with false discovery rate control

2019 ◽  
Vol 35 (17) ◽  
pp. 3184-3186
Author(s):  
Xiao-Fei Zhang ◽  
Le Ou-Yang ◽  
Shuo Yang ◽  
Xiaohua Hu ◽  
Hong Yan

Abstract Summary To identify biological network rewiring under different conditions, we develop a user-friendly R package, named DiffNetFDR, to implement two methods developed for testing the difference in different Gaussian graphical models. Compared to existing tools, our methods have the following features: (i) they are based on Gaussian graphical models which can capture the changes of conditional dependencies; (ii) they determine the tuning parameters in a data-driven manner; (iii) they take a multiple testing procedure to control the overall false discovery rate; and (iv) our approach defines the differential network based on partial correlation coefficients so that the spurious differential edges caused by the variants of conditional variances can be excluded. We also develop a Shiny application to provide easier analysis and visualization. Simulation studies are conducted to evaluate the performance of our methods. We also apply our methods to two real gene expression datasets. The effectiveness of our methods is validated by the biological significance of the identified differential networks. Availability and implementation R package and Shiny app are available at https://github.com/Zhangxf-ccnu/DiffNetFDR. Supplementary information Supplementary data are available at Bioinformatics online.

Author(s):  
Xin Bai ◽  
Jie Ren ◽  
Yingying Fan ◽  
Fengzhu Sun

Abstract Motivation The rapid development of sequencing technologies has enabled us to generate a large number of metagenomic reads from genetic materials in microbial communities, making it possible to gain deep insights into understanding the differences between the genetic materials of different groups of microorganisms, such as bacteria, viruses, plasmids, etc. Computational methods based on k-mer frequencies have been shown to be highly effective for classifying metagenomic sequencing reads into different groups. However, such methods usually use all the k-mers as features for prediction without selecting relevant k-mers for the different groups of sequences, i.e. unique nucleotide patterns containing biological significance. Results To select k-mers for distinguishing different groups of sequences with guaranteed false discovery rate (FDR) control, we develop KIMI, a general framework based on model-X Knockoffs regarded as the state-of-the-art statistical method for FDR control, for sequence motif discovery with arbitrary target FDR level, such that reproducibility can be theoretically guaranteed. KIMI is shown through simulation studies to be effective in simultaneously controlling FDR and yielding high power, outperforming the broadly used Benjamini–Hochberg procedure and the q-value method for FDR control. To illustrate the usefulness of KIMI in analyzing real datasets, we take the viral motif discovery problem as an example and implement KIMI on a real dataset consisting of viral and bacterial contigs. We show that the accuracy of predicting viral and bacterial contigs can be increased by training the prediction model only on relevant k-mers selected by KIMI. Availabilityand implementation Our implementation of KIMI is available at https://github.com/xinbaiusc/KIMI. Supplementary information Supplementary data are available at Bioinformatics online.


2020 ◽  
Vol 36 (Supplement_2) ◽  
pp. i745-i753
Author(s):  
Yisu Peng ◽  
Shantanu Jain ◽  
Yong Fuga Li ◽  
Michal Greguš ◽  
Alexander R. Ivanov ◽  
...  

Abstract Motivation Accurate estimation of false discovery rate (FDR) of spectral identification is a central problem in mass spectrometry-based proteomics. Over the past two decades, target-decoy approaches (TDAs) and decoy-free approaches (DFAs) have been widely used to estimate FDR. TDAs use a database of decoy species to faithfully model score distributions of incorrect peptide-spectrum matches (PSMs). DFAs, on the other hand, fit two-component mixture models to learn the parameters of correct and incorrect PSM score distributions. While conceptually straightforward, both approaches lead to problems in practice, particularly in experiments that push instrumentation to the limit and generate low fragmentation-efficiency and low signal-to-noise-ratio spectra. Results We introduce a new decoy-free framework for FDR estimation that generalizes present DFAs while exploiting more search data in a manner similar to TDAs. Our approach relies on multi-component mixtures, in which score distributions corresponding to the correct PSMs, best incorrect PSMs and second-best incorrect PSMs are modeled by the skew normal family. We derive EM algorithms to estimate parameters of these distributions from the scores of best and second-best PSMs associated with each experimental spectrum. We evaluate our models on multiple proteomics datasets and a HeLa cell digest case study consisting of more than a million spectra in total. We provide evidence of improved performance over existing DFAs and improved stability and speed over TDAs without any performance degradation. We propose that the new strategy has the potential to extend beyond peptide identification and reduce the need for TDA on all analytical platforms. Availabilityand implementation https://github.com/shawn-peng/FDR-estimation. Supplementary information Supplementary data are available at Bioinformatics online.


2019 ◽  
Vol 36 (8) ◽  
pp. 2587-2588 ◽  
Author(s):  
Christopher M Ward ◽  
Thu-Hien To ◽  
Stephen M Pederson

Abstract Motivation High throughput next generation sequencing (NGS) has become exceedingly cheap, facilitating studies to be undertaken containing large sample numbers. Quality control (QC) is an essential stage during analytic pipelines and the outputs of popular bioinformatics tools such as FastQC and Picard can provide information on individual samples. Although these tools provide considerable power when carrying out QC, large sample numbers can make inspection of all samples and identification of systemic bias a challenge. Results We present ngsReports, an R package designed for the management and visualization of NGS reports from within an R environment. The available methods allow direct import into R of FastQC reports along with outputs from other tools. Visualization can be carried out across many samples using default, highly customizable plots with options to perform hierarchical clustering to quickly identify outlier libraries. Moreover, these can be displayed in an interactive shiny app or HTML report for ease of analysis. Availability and implementation The ngsReports package is available on Bioconductor and the GUI shiny app is available at https://github.com/UofABioinformaticsHub/shinyNgsreports. Supplementary information Supplementary data are available at Bioinformatics online.


2019 ◽  
Vol 35 (21) ◽  
pp. 4507-4508 ◽  
Author(s):  
Geremy Clair ◽  
Sarah Reehl ◽  
Kelly G Stratton ◽  
Matthew E Monroe ◽  
Malak M Tfaily ◽  
...  

Abstract Summary Here we introduce Lipid Mini-On, an open-source tool that performs lipid enrichment analyses and visualizations of lipidomics data. Lipid Mini-On uses a text-mining process to bin individual lipid names into multiple lipid ontology groups based on the classification (e.g. LipidMaps) and other characteristics, such as chain length. Lipid Mini-On provides users with the capability to conduct enrichment analysis of the lipid ontology terms using a Shiny app with options of five statistical approaches. Lipid classes can be added to customize the user’s database and remain updated as new lipid classes are discovered. Visualization of results is available for all classification options (e.g. lipid subclass and individual fatty acid chains). Results are also visualized through an editable network of relationships between the individual lipids and their associated lipid ontology terms. The utility of the tool is demonstrated using biological (e.g. human lung endothelial cells) and environmental (e.g. peat soil) samples. Availability and implementation Rodin (R package: https://github.com/PNNL-Comp-Mass-Spec/Rodin), Lipid Mini-On Shiny app (https://github.com/PNNL-Comp-Mass-Spec/LipidMiniOn) and Lipid Mini-On online tool (https://omicstools.pnnl.gov/shiny/lipid-mini-on/). Supplementary information Supplementary data are available at Bioinformatics online.


Genes ◽  
2020 ◽  
Vol 11 (2) ◽  
pp. 167 ◽  
Author(s):  
Qingyang Zhang

The nonparanormal graphical model has emerged as an important tool for modeling dependency structure between variables because it is flexible to non-Gaussian data while maintaining the good interpretability and computational convenience of Gaussian graphical models. In this paper, we consider the problem of detecting differential substructure between two nonparanormal graphical models with false discovery rate control. We construct a new statistic based on a truncated estimator of the unknown transformation functions, together with a bias-corrected sample covariance. Furthermore, we show that the new test statistic converges to the same distribution as its oracle counterpart does. Both synthetic data and real cancer genomic data are used to illustrate the promise of the new method. Our proposed testing framework is simple and scalable, facilitating its applications to large-scale data. The computational pipeline has been implemented in the R package DNetFinder, which is freely available through the Comprehensive R Archive Network.


2019 ◽  
Vol 35 (19) ◽  
pp. 3592-3598 ◽  
Author(s):  
Justin G Chitpin ◽  
Aseel Awdeh ◽  
Theodore J Perkins

Abstract Motivation Chromatin Immunopreciptation (ChIP)-seq is used extensively to identify sites of transcription factor binding or regions of epigenetic modifications to the genome. A key step in ChIP-seq analysis is peak calling, where genomic regions enriched for ChIP versus control reads are identified. Many programs have been designed to solve this task, but nearly all fall into the statistical trap of using the data twice—once to determine candidate enriched regions, and again to assess enrichment by classical statistical hypothesis testing. This double use of the data invalidates the statistical significance assigned to enriched regions, thus the true significance or reliability of peak calls remains unknown. Results Using simulated and real ChIP-seq data, we show that three well-known peak callers, MACS, SICER and diffReps, output biased P-values and false discovery rate estimates that can be many orders of magnitude too optimistic. We propose a wrapper algorithm, RECAP, that uses resampling of ChIP-seq and control data to estimate a monotone transform correcting for biases built into peak calling algorithms. When applied to null hypothesis data, where there is no enrichment between ChIP-seq and control, P-values recalibrated by RECAP are approximately uniformly distributed. On data where there is genuine enrichment, RECAP P-values give a better estimate of the true statistical significance of candidate peaks and better false discovery rate estimates, which correlate better with empirical reproducibility. RECAP is a powerful new tool for assessing the true statistical significance of ChIP-seq peak calls. Availability and implementation The RECAP software is available through www.perkinslab.ca or on github at https://github.com/theodorejperkins/RECAP. Supplementary information Supplementary data are available at Bioinformatics online.


2004 ◽  
Vol 17 (22) ◽  
pp. 4343-4356 ◽  
Author(s):  
Valérie Ventura ◽  
Christopher J. Paciorek ◽  
James S. Risbey

Abstract The analysis of climatological data often involves statistical significance testing at many locations. While the field significance approach determines if a field as a whole is significant, a multiple testing procedure determines which particular tests are significant. Many such procedures are available, most of which control, for every test, the probability of detecting significance that does not really exist. The aim of this paper is to introduce the novel “false discovery rate” approach, which controls the false rejections in a more meaningful way. Specifically, it controls a priori the expected proportion of falsely rejected tests out of all rejected tests; additionally, the test results are more easily interpretable. The paper also investigates the best way to apply a false discovery rate (FDR) approach to spatially correlated data, which are common in climatology. The most straightforward method for controlling the FDR makes an assumption of independence between tests, while other FDR-controlling methods make less stringent assumptions. In a simulation study involving data with correlation structure similar to that of a real climatological dataset, the simple FDR method does control the proportion of falsely rejected hypotheses despite the violation of assumptions, while a more complicated method involves more computation with little gain in detecting alternative hypotheses. A very general method that makes no assumptions controls the proportion of falsely rejected hypotheses but at the cost of detecting few alternative hypotheses. Despite its unrealistic assumption, based on the simulation results, the authors suggest the use of the straightforward FDR-controlling method and provide a simple modification that increases the power to detect alternative hypotheses.


2012 ◽  
Vol 2012 ◽  
pp. 1-14 ◽  
Author(s):  
Aiping Liu ◽  
Junning Li ◽  
Z. Jane Wang ◽  
Martin J. McKeown

Graphical models appear well suited for inferring brain connectivity from fMRI data, as they can distinguish between direct and indirect brain connectivity. Nevertheless, biological interpretation requires not only that the multivariate time series are adequately modeled, but also that there is accurate error-control of the inferred edges. The PCfdralgorithm, which was developed by Li and Wang, was to provide a computationally efficient means to control the false discovery rate (FDR) of computed edges asymptotically. The original PCfdralgorithm was unable to accommodatea prioriinformation about connectivity and was designed to infer connectivity from a single subject rather than a group of subjects. Here we extend the original PCfdralgorithm and propose a multisubject, error-rate-controlled brain connectivity modeling approach that allows incorporation of prior knowledge of connectivity. In simulations, we show that the two proposed extensions can still control the FDR around or below a specified threshold. When the proposed approach is applied to fMRI data in a Parkinson’s disease study, we find robust group evidence of the disease-related changes, the compensatory changes, and the normalizing effect of L-dopa medication. The proposed method provides a robust, accurate, and practical method for the assessment of brain connectivity patterns from functional neuroimaging data.


Sign in / Sign up

Export Citation Format

Share Document