Dimension constraints improve hypothesis testing for large-scale, graph-associated, brain-image data

Biostatistics ◽

10.1093/biostatistics/kxab001 ◽

2021 ◽

Author(s):

Tien Vo ◽

Akshay Mishra ◽

Vamsi Ithapu ◽

Vikas Singh ◽

Michael A Newton

Keyword(s):

Empirical Bayes ◽

Large Scale ◽

Image Data ◽

Imaging Data ◽

False Discovery Rates ◽

Testing Procedures ◽

False Discovery ◽

Large Scale Testing ◽

Brain Changes ◽

Connected Subgraphs

Summary For large-scale testing with graph-associated data, we present an empirical Bayes mixture technique to score local false-discovery rates (FDRs). Compared to procedures that ignore the graph, the proposed Graph-based Mixture Model (GraphMM) method gains power in settings where non-null cases form connected subgraphs, and it does so by regularizing parameter contrasts between testing units. Simulations show that GraphMM controls the FDR in a variety of settings, though it may lose control with excessive regularization. On magnetic resonance imaging data from a study of brain changes associated with the onset of Alzheimer’s disease, GraphMM produces greater yield than conventional large-scale testing procedures.

Download Full-text

False Discovery Rates: A New Deal

10.1101/038216 ◽

2016 ◽

Cited By ~ 4

Author(s):

Matthew Stephens

Keyword(s):

Empirical Bayes ◽

Large Scale ◽

R Package ◽

False Discovery Rates ◽

Interval Estimates ◽

False Discovery ◽

Unobserved Effects ◽

The One ◽

Significance Measures ◽

Discovery Rates

AbstractWe introduce a new Empirical Bayes approach for large-scale hypothesis testing, including estimating False Discovery Rates (FDRs), and effect sizes. This approach has two key differences from existing approaches to FDR analysis. First, it assumes that the distribution of the actual (unobserved) effects is unimodal, with a mode at 0. This “unimodal assumption” (UA), although natural in many contexts, is not usually incorporated into standard FDR analysis, and we demonstrate how incorporating it brings many benefits. Specifically, the UA facilitates efficient and robust computation – estimating the unimodal distribution involves solving a simple convex optimization problem – and enables more accurate inferences provided that it holds. Second, the method takes as its input two numbers for each test (an effect size estimate, and corresponding standard error), rather than the one number usually used (p value, or z score). When available, using two numbers instead of one helps account for variation in measurement precision across tests. It also facilitates estimation of effects, and unlike standard FDR methods our approach provides interval estimates (credible regions) for each effect in addition to measures of significance. To provide a bridge between interval estimates and significance measures we introduce the term “local false sign rate” to refer to the probability of getting the sign of an effect wrong, and argue that it is a superior measure of significance than the local FDR because it is both more generally applicable, and can be more robustly estimated. Our methods are implemented in an R package ashr available from http://github.com/stephens999/ashr.

Download Full-text

Significance estimation for large scale untargeted metabolomics annotations

10.1101/109389 ◽

2017 ◽

Cited By ~ 4

Author(s):

Kerstin Scheubert ◽

Franziska Hufsky ◽

Daniel Petras ◽

Mingxun Wang ◽

Louis-Félix Nothias ◽

...

Keyword(s):

Small Molecules ◽

Empirical Bayes ◽

Large Scale ◽

Estimation Methods ◽

Scale Analysis ◽

Reference Library ◽

False Discovery Rates ◽

False Discovery ◽

Large Scale Analysis ◽

Discovery Rates

AbstractThe annotation of small molecules in untargeted mass spectrometry relies on the matching of fragment spectra to reference library spectra. While various spectrum-spectrum match scores exist, the field lacks statistical methods for estimating the false discovery rates (FDR) of these annotations. We present empirical Bayes and target-decoy based methods to estimate the false discovery rate. Relying on estimations of false discovery rates, we explore the effect of different spectrum-spectrum match criteria on the number and the nature of the molecules annotated. We show that the spectral matching settings needs to be adjusted for each project. By adjusting the scoring parameters and thresholds, the number of annotations rose, on average, by +139% (ranging from −92% up to +5705%) when compared to a default parameter set available at GNPS. The FDR estimation methods presented will enable a user to define the scoring criteria for large scale analysis of untargeted small molecule data that has been essential in the advancement of large scale proteomics, transcriptomics, and genomics science.

Download Full-text

Large and ancient linguistic areas

Language Dispersal, Diversification, and Contact ◽

10.1093/oso/9780198723813.003.0005 ◽

2020 ◽

pp. 78-100

Author(s):

Balthasar Bickel

Keyword(s):

Regression Models ◽

Large Scale ◽

Population History ◽

False Discovery Rates ◽

Ancient Population ◽

False Discovery ◽

Language Universals ◽

Discovery Rates ◽

Pacific Area

Large-scale areal patterns point to ancient population history and form a well-known confound for language universals. Despite their importance, demonstrating such patterns remains a challenge. This chapter argues that large-scale area hypotheses are better tested by modeling diachronic family biases than by controlling for genealogical relations in regression models. A case study of the Trans-Pacific area reveals that diachronic bias estimates do not depend much on the amount of phylogenetic information that is used when inferring them. After controlling for false discovery rates, about 39 variables in WALS and AUTOTYP show diachronic biases that differ significantly inside vs. outside the Trans-Pacific area. Nearly three times as many biases hold outside than inside the Trans-Pacific area, indicating that the Trans-Pacific area is not so much characterized by the spread of biases but rather by the retention of earlier diversity, in line with earlier suggestions in the literature.

Download Full-text

Joint Imaging Platform for Federated Clinical Data Analytics

JCO Clinical Cancer Informatics ◽

10.1200/cci.20.00045 ◽

2020 ◽

pp. 1027-1038

Author(s):

Jonas Scherer ◽

Marco Nolden ◽

Jens Kleesiek ◽

Jasmin Metzger ◽

Klaus Kades ◽

...

Keyword(s):

Data Analytics ◽

Large Scale ◽

Image Data ◽

Clinical Information ◽

Quality Data ◽

Imaging Data ◽

University Hospitals ◽

High Quality Data ◽

Interactive Analysis ◽

Legal Constraints

PURPOSE Image analysis is one of the most promising applications of artificial intelligence (AI) in health care, potentially improving prediction, diagnosis, and treatment of diseases. Although scientific advances in this area critically depend on the accessibility of large-volume and high-quality data, sharing data between institutions faces various ethical and legal constraints as well as organizational and technical obstacles. METHODS The Joint Imaging Platform (JIP) of the German Cancer Consortium (DKTK) addresses these issues by providing federated data analysis technology in a secure and compliant way. Using the JIP, medical image data remain in the originator institutions, but analysis and AI algorithms are shared and jointly used. Common standards and interfaces to local systems ensure permanent data sovereignty of participating institutions. RESULTS The JIP is established in the radiology and nuclear medicine departments of 10 university hospitals in Germany (DKTK partner sites). In multiple complementary use cases, we show that the platform fulfills all relevant requirements to serve as a foundation for multicenter medical imaging trials and research on large cohorts, including the harmonization and integration of data, interactive analysis, automatic analysis, federated machine learning, and extensibility and maintenance processes, which are elementary for the sustainability of such a platform. CONCLUSION The results demonstrate the feasibility of using the JIP as a federated data analytics platform in heterogeneous clinical information technology and software landscapes, solving an important bottleneck for the application of AI to large-scale clinical imaging data.

Download Full-text

Resampling-Based Empirical Bayes Multiple Testing Procedures for Controlling Generalized Tail Probability and Expected Value Error Rates: Focus on the False Discovery Rate and Simulation Study

Biometrical Journal ◽

10.1002/bimj.200710473 ◽

2008 ◽

Vol 50 (5) ◽

pp. 716-744 ◽

Cited By ~ 12

Author(s):

Sandrine Dudoit ◽

Houston N. Gilbert ◽

Mark J. van der Laan

Keyword(s):

False Discovery Rate ◽

Simulation Study ◽

Multiple Testing ◽

Empirical Bayes ◽

Tail Probability ◽

Error Rates ◽

Expected Value ◽

Testing Procedures ◽

False Discovery ◽

Multiple Testing Procedures

Download Full-text

False discovery rates for large-scale model checking under certain dependence

Communication in Statistics- Theory and Methods ◽

10.1080/03610926.2017.1300279 ◽

2017 ◽

Vol 47 (1) ◽

pp. 64-79

Author(s):

Lu Deng ◽

Xuemin Zi ◽

Zhonghua Li

Keyword(s):

Model Checking ◽

Large Scale ◽

Scale Model ◽

False Discovery Rates ◽

False Discovery ◽

Large Scale Model ◽

Discovery Rates

Download Full-text

benchmarkR: an R package for benchmarking genome-scale methods

10.1101/018200 ◽

2015 ◽

Author(s):

Xiaobei Zhou ◽

Charity W Law ◽

Mark D Robinson

Keyword(s):

Receiver Operating Characteristic ◽

Statistical Methods ◽

Large Scale ◽

Operating Characteristic ◽

Scale Validation ◽

Roc Curves ◽

R Package ◽

False Discovery Rates ◽

False Discovery ◽

Receiver Operating

benchmarkR is an R package designed to assess and visualize the performance of statistical methods for datasets that have an independent truth (e.g., simulations or datasets with large-scale validation), in particular for methods that claim to control false discovery rates (FDR). We augment some of the standard performance plots (e.g., receiver operating characteristic, or ROC, curves) with information about how well the methods are calibrated (i.e., whether they achieve their expected FDR control). For example, performance plots are extended with a point to highlight the power or FDR at a user-set threshold (e.g., at a method's estimated 5% FDR). The package contains general containers to store simulation results (SimResults) and methods to create graphical summaries, such as receiver operating characteristic curves (rocX), false discovery plots (fdX) and power-to-achieved FDR plots (powerFDR); each plot is augmented with some form of calibration information. We find these plots to be an improved way to interpret relative performance of statistical methods for genomic datasets where many hypothesis tests are performed. The strategies, however, are general and will find applications in other domains.

Download Full-text

Quantifying, and correcting for, the impact of questionable research practices on false discovery rates in psychological science

10.31234/osf.io/fu9gy ◽

2020 ◽

Cited By ~ 1

Author(s):

Dwight Kravitz ◽

Stephen Mitroff

Keyword(s):

Large Scale ◽

Unintended Consequences ◽

Research Practices ◽

False Discovery Rates ◽

Questionable Research Practices ◽

Real Effects ◽

False Discovery ◽

Replication Crisis ◽

The Impact ◽

Discovery Rates

Large-scale replication failures have shaken confidence in the social sciences, psychology in particular. Most researchers acknowledge the problem, yet there is widespread debate about the causes and solutions. Using “big data,” the current project demonstrates that unintended consequences of three common questionable research practices (retaining pilot data, adding data after checking for significance, and not publishing null findings) can explain the lion’s share of the replication failures. A massive dataset was randomized to create a true null effect between two conditions, and then these three practices were applied. They produced false discovery rates far greater than 5% (the generally accepted rate), and were strong enough to obscure, or even reverse, the direction of real effects. These demonstrations suggest that much of the replication crisis might be explained by simple, misguided experimental choices. This approach also produces empirically-based corrections to account for these practices when they are unavoidable, providing a viable path forward.

Download Full-text

Simultaneous control of all false discovery proportions in large-scale multiple hypothesis testing

Biometrika ◽

10.1093/biomet/asz041 ◽

2019 ◽

Vol 106 (4) ◽

pp. 841-856 ◽

Cited By ~ 4

Author(s):

Jelle J Goeman ◽

Rosa J Meijer ◽

Thijmen J P Krebs ◽

Aldo Solari

Keyword(s):

Large Scale ◽

Average Power ◽

Multiple Hypothesis Testing ◽

Familywise Error Rate ◽

Confidence Bounds ◽

Testing Procedures ◽

False Discovery ◽

Simultaneous Control ◽

Simultaneous Confidence Bounds ◽

False Discovery Proportion

Summary Closed testing procedures are classically used for familywise error rate control, but they can also be used to obtain simultaneous confidence bounds for the false discovery proportion in all subsets of the hypotheses, allowing for inference robust to post hoc selection of subsets. In this paper we investigate the special case of closed testing with Simes local tests. We construct a novel fast and exact shortcut and use it to investigate the power of this approach when the number of hypotheses goes to infinity. We show that if a minimal level of signal is present, the average power to detect false hypotheses at any desired false discovery proportion does not vanish. Additionally, we show that the confidence bounds for false discovery proportion are consistent estimators for the true false discovery proportion for every nonvanishing subset. We also show close connections between Simes-based closed testing and the procedure of Benjamini and Hochberg.

Download Full-text

Fast and Accurate Protein False Discovery Rates on Large-Scale Proteomics Data Sets with Percolator 3.0

Journal of the American Society for Mass Spectrometry ◽

10.1007/s13361-016-1460-7 ◽

2016 ◽

Vol 27 (11) ◽

pp. 1719-1727 ◽

Cited By ~ 103

Author(s):

Matthew The ◽

Michael J. MacCoss ◽

William S. Noble ◽

Lukas Käll

Keyword(s):

Large Scale ◽

Data Sets ◽

Proteomics Data ◽

False Discovery Rates ◽

False Discovery ◽

Discovery Rates

Download Full-text