Controlling bias and inflation in epigenome- and transcriptome-wide association studies using the empirical null distribution

Mapping Intimacies ◽

10.1101/055772 ◽

2016 ◽

Cited By ~ 1

Author(s):

Maarten van Iterson ◽

Erik van Zwet ◽

P. Eline Slagboom ◽

Bastiaan T. Heijmans ◽

Keyword(s):

Association Studies ◽

Meta Analysis ◽

False Positive Rate ◽

Null Distribution ◽

Simulation Studies ◽

Test Statistics ◽

Empirical Null Distribution ◽

Novel Approach ◽

Level Data ◽

Positive Rate

ABSTRACTAssociation studies on omic-level data other then genotypes (GWAS) are becoming increasingly common, i.e., epigenome-and transcriptome-wide association studies (EWAS/TWAS). However, a tool box for the analysis of EWAS and TWAS studies is largely lacking and often approaches from GWAS are applied despite the fact that epigenome and transcriptome data have vedifferent characteristics than genotypes. Here, we show that EWASs and TWASs are prone not only to significant inflation but also bias of the test statistics and that these are not properly addressed by GWAS-based methodology (i.e. genomic control) and state-of-the-art approaches to control for unmeasured confounding (i.e. RUV, sva and cate). We developed a novel approach that is based on the estimation of the empirical null distribution using Bayesian statistics. Using simulation studies and empirical data, we demonstrate that our approach maximizes power while properly controlling the false positive rate. Finally, we illustrate the utility of our method in the application of meta-analysis by performing EWASs and TWASs on age and smoking which highlighted an overlap in differential methylation and expression of associated genes. An implementation of our new method is available from http://bioconductor.org/packages/bacon/.

Download Full-text

Applying meta-analysis to Genotype-Tissue Expression data from multiple tissues to identify eQTLs and increase the number of eGenes

10.1101/100701 ◽

2017 ◽

Author(s):

Dat Duong ◽

Lisa Gai ◽

Sagi Snir ◽

Eun Yong Kang ◽

Buhm Han ◽

...

Keyword(s):

Gene Expression ◽

Association Studies ◽

Meta Analysis ◽

False Positive Rate ◽

Tissue Expression ◽

Great Effort ◽

Genome Wide Association Studies ◽

Biological Insight ◽

Positive Rate ◽

Computationally Intensive

AbstractDuring the last decade, with the advent of inexpensive microarray and RNA-seq technologies, there have been many expression quantitative trait loci (eQTL) studies for identifying genetic variants called eQTLs that regulate gene expression. Discovering eQTLs has been increasingly important as they may elucidate the functional consequence of non-coding variants identified from genome-wide association studies. Recently, several eQTL studies such as the Genotype-Tissue Expression (GTEx) consortium have made a great effort to obtain gene expression from multiple tissues. One advantage of these multi-tissue eQTL datasets is that they may allow one to identify more eQTLs by combining information across multiple tissues. Although a few methods have been proposed for multi-tissue eQTL studies, they are often computationally intensive and may not achieve optimal power because they do not consider a biological insight that a genetic variant regulates gene expression similarly in related tissues. In this paper, we propose an efficient meta-analysis approach for identifying eQTLs from large multi-tissue eQTL datasets. We name our method RECOV because it uses a random effects (RE) meta-analysis with an explicit covariance (COV) term to model the correlation of effect that eQTLs have across tissues. Our approach is faster than the previous approaches and properly controls the false-positive rate. We apply our approach to the real multi-tissue eQTL dataset from GTEx that contains 44 tissues, and show that our approach detects more eQTLs and eGenes than previous approaches.

Download Full-text

Identification of and Correction for Publication Bias: Comment

10.31222/osf.io/dh87m ◽

2019 ◽

Author(s):

Amanda Kvarven ◽

Eirik Strømland ◽

Magnus Johannesson

Keyword(s):

Publication Bias ◽

False Positive ◽

Large Scale ◽

Meta Analysis ◽

False Positive Rate ◽

Effect Sizes ◽

Replication Studies ◽

Moderate Reduction ◽

Positive Rate ◽

Meta Analyses

Andrews & Kasy (2019) propose an approach for adjusting effect sizes in meta-analysis for publication bias. We use the Andrews-Kasy estimator to adjust the result of 15 meta-analyses and compare the adjusted results to 15 large-scale multiple labs replication studies estimating the same effects. The pre-registered replications provide precisely estimated effect sizes, which do not suffer from publication bias. The Andrews-Kasy approach leads to a moderate reduction of the inflated effect sizes in the meta-analyses. However, the approach still overestimates effect sizes by a factor of about two or more and has an estimated false positive rate of between 57% and 100%.

Download Full-text

Estimating the statistical performance of different approaches to meta-analysis of data from animal studies in identifying the impact of aspects of study design

10.1101/256776 ◽

2018 ◽

Cited By ~ 4

Author(s):

Qianying Wang ◽

Jing Liao ◽

Kaitlyn Hair ◽

Alexandra Bannach-Brown ◽

Zsanett Bahor ◽

...

Keyword(s):

Effect Size ◽

False Positive ◽

Latent Variable ◽

Statistical Power ◽

Animal Studies ◽

Meta Analysis ◽

False Positive Rate ◽

Mean Difference ◽

Positive Rate ◽

Meta Regression

AbstractBackgroundMeta-analysis is increasingly used to summarise the findings identified in systematic reviews of animal studies modelling human disease. Such reviews typically identify a large number of individually small studies, testing efficacy under a variety of conditions. This leads to substantial heterogeneity, and identifying potential sources of this heterogeneity is an important function of such analyses. However, the statistical performance of different approaches (normalised compared with standardised mean difference estimates of effect size; stratified meta-analysis compared with meta-regression) is not known.MethodsUsing data from 3116 experiments in focal cerebral ischaemia to construct a linear model predicting observed improvement in outcome contingent on 25 independent variables. We used stochastic simulation to attribute these variables to simulated studies according to their prevalence. To ascertain the ability to detect an effect of a given variable we introduced in addition this “variable of interest” of given prevalence and effect. To establish any impact of a latent variable on the apparent influence of the variable of interest we also introduced a “latent confounding variable” with given prevalence and effect, and allowed the prevalence of the variable of interest to be different in the presence and absence of the latent variable.ResultsGenerally, the normalised mean difference (NMD) approach had higher statistical power than the standardised mean difference (SMD) approach. Even when the effect size and the number of studies contributing to the meta-analysis was small, there was good statistical power to detect the overall effect, with a low false positive rate. For detecting an effect of the variable of interest, stratified meta-analysis was associated with a substantial false positive rate with NMD estimates of effect size, while using an SMD estimate of effect size had very low statistical power. Univariate and multivariable meta-regression performed substantially better, with low false positive rate for both NMD and SMD approaches; power was higher for NMD than for SMD. The presence or absence of a latent confounding variables only introduced an apparent effect of the variable of interest when there was substantial asymmetry in the prevalence of the variable of interest in the presence or absence of the confounding variable.ConclusionsIn meta-analysis of data from animal studies, NMD estimates of effect size should be used in preference to SMD estimates, and meta-regression should, where possible, be chosen over stratified meta-analysis. The power to detect the influence of the variable of interest depends on the effect of the variable of interest and its prevalence, but unless effects are very large adequate power is only achieved once at least 100 experiments are included in the meta-analysis.

Download Full-text

A Novel Approach for Computer-Aided Diagnosis for Distinction Between Benign and Malignant of Lung Nodules Based on Machine Learning Techniques

Handbook of Research on Information Security in Biomedical Signal Processing - Advances in Information Security, Privacy, and Ethics ◽

10.4018/978-1-5225-5152-2.ch014 ◽

2018 ◽

pp. 281-290

Author(s):

Shashidhara Bola

Keyword(s):

False Positive Rate ◽

Lung Nodule ◽

Machine Learning Techniques ◽

Lung Nodules ◽

Data Set ◽

Novel Approach ◽

Learning Techniques ◽

Nodule Shape ◽

Positive Rate ◽

Aided Diagnosis

A new method is proposed to classify the lung nodules as benign and malignant. The method is based on analysis of lung nodule shape, contour, and texture for better classification. The data set consists of 39 lung nodules of 39 patients which contain 19 benign and 20 malignant nodules. Lung regions are segmented based on morphological operators and lung nodules are detected based on shape and area features. The proposed algorithm was tested on LIDC (lung image database consortium) datasets and the results were found to be satisfactory. The performance of the method for distinction between benign and malignant was evaluated by the use of receiver operating characteristic (ROC) analysis. The method achieved area under the ROC curve was 0.903 which reduces the false positive rate.

Download Full-text

Convolutional Neural Networks for the Segmentation of Microcalcification in Mammography Imaging

Journal of Healthcare Engineering ◽

10.1155/2019/9360941 ◽

2019 ◽

Vol 2019 ◽

pp. 1-9 ◽

Cited By ~ 8

Author(s):

Gabriele Valvano ◽

Gianmarco Santini ◽

Nicola Martini ◽

Andrea Ripoli ◽

Chiara Iacconi ◽

...

Keyword(s):

Breast Cancer ◽

Neural Networks ◽

Deep Learning ◽

Convolutional Neural Networks ◽

False Positive ◽

False Positive Rate ◽

Novel Approach ◽

Positive Rate ◽

Microcalcification Detection ◽

Microcalcification Clusters

Cluster of microcalcifications can be an early sign of breast cancer. In this paper, we propose a novel approach based on convolutional neural networks for the detection and segmentation of microcalcification clusters. In this work, we used 283 mammograms to train and validate our model, obtaining an accuracy of 99.99% on microcalcification detection and a false positive rate of 0.005%. Our results show how deep learning could be an effective tool to effectively support radiologists during mammograms examination.

Download Full-text

Gene set enrichment for reproducible science: comparison of CERNO and eight other algorithms

Bioinformatics ◽

10.1093/bioinformatics/btz447 ◽

2019 ◽

Vol 35 (24) ◽

pp. 5146-5154 ◽

Cited By ~ 19

Author(s):

Joanna Zyla ◽

Michal Marczyk ◽

Teresa Domaszewska ◽

Stefan H E Kaufmann ◽

Joanna Polanska ◽

...

Keyword(s):

False Positive Rate ◽

R Package ◽

Supplementary Information ◽

Computational Time ◽

P Value ◽

Gene Set ◽

Related Data ◽

Novel Approach ◽

Positive Rate ◽

Real World Datasets

Abstract Motivation Analysis of gene set (GS) enrichment is an essential part of functional omics studies. Here, we complement the established evaluation metrics of GS enrichment algorithms with a novel approach to assess the practical reproducibility of scientific results obtained from GS enrichment tests when applied to related data from different studies. Results We evaluated eight established and one novel algorithm for reproducibility, sensitivity, prioritization, false positive rate and computational time. In addition to eight established algorithms, we also included Coincident Extreme Ranks in Numerical Observations (CERNO), a flexible and fast algorithm based on modified Fisher P-value integration. Using real-world datasets, we demonstrate that CERNO is robust to ranking metrics, as well as sample and GS size. CERNO had the highest reproducibility while remaining sensitive, specific and fast. In the overall ranking Pathway Analysis with Down-weighting of Overlapping Genes, CERNO and over-representation analysis performed best, while CERNO and GeneSetTest scored high in terms of reproducibility. Availability and implementation tmod package implementing the CERNO algorithm is available from CRAN (cran.r-project.org/web/packages/tmod/index.html) and an online implementation can be found at http://tmod.online/. The datasets analyzed in this study are widely available in the KEGGdzPathwaysGEO, KEGGandMetacoreDzPathwaysGEO R package and GEO repository. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Novel approach to control false positive rate in fuzzy cluster analysis of fMRI

10.1117/12.535324 ◽

2004 ◽

Author(s):

Hesamoddin Jahanian ◽

Hamid Soltanian-Zadeh ◽

Gholam-Ali Hossein-Zadeh

Keyword(s):

Cluster Analysis ◽

False Positive ◽

False Positive Rate ◽

Fuzzy Cluster ◽

Fuzzy Cluster Analysis ◽

Novel Approach ◽

Positive Rate

Download Full-text

A fast mrMLM algorithm for multi-locus genome-wide association studies

10.1101/341784 ◽

2018 ◽

Cited By ~ 23

Author(s):

Cox Lwaka Tamba ◽

Yuan-Ming Zhang

Keyword(s):

False Positive ◽

Statistical Power ◽

Association Studies ◽

False Positive Rate ◽

Real Data ◽

High Accuracy ◽

Genome Wide Association ◽

Genome Wide Association Studies ◽

Genome Wide ◽

Positive Rate

AbstractBackgroundRecent developments in technology result in the generation of big data. In genome-wide association studies (GWAS), we can get tens of million SNPs that need to be tested for association with a trait of interest. Indeed, this poses a great computational challenge. There is a need for developing fast algorithms in GWAS methodologies. These algorithms must ensure high power in QTN detection, high accuracy in QTN estimation and low false positive rate.ResultsHere, we accelerated mrMLM algorithm by using GEMMA idea, matrix transformations and identities. The target functions and derivatives in vector/matrix forms for each marker scanning are transformed into some simple forms that are easy and efficient to evaluate during each optimization step. All potentially associated QTNs with P-values ≤ 0.01 are evaluated in a multi-locus model by LARS algorithm and/or EM-Empirical Bayes. We call the algorithm FASTmrMLM. Numerical simulation studies and real data analysis validated the FASTmrMLM. FASTmrMLM reduces the running time in mrMLM by more than 50%. FASTmrMLM also shows high statistical power in QTN detection, high accuracy in QTN estimation and low false positive rate as compared to GEMMA, FarmCPU and mrMLM. Real data analysis shows that FASTmrMLM was able to detect more previously reported genes than all the other methods: GEMMA/EMMA, FarmCPU and mrMLM.ConclusionsFASTmrMLM is a fast and reliable algorithm in multi-locus GWAS and ensures high statistical power, high accuracy of estimates and low false positive rate.Author SummaryThe current developments in technology result in the generation of a vast amount of data. In genome-wide association studies, we can get tens of million markers that need to be tested for association with a trait of interest. Due to the computational challenge faced, we developed a fast algorithm for genome-wide association studies. Our approach is a two stage method. In the first step, we used matrix transformations and identities to quicken the testing of each random marker effect. The target functions and derivatives which are in vector/matrix forms for each marker scanning are transformed into some simple forms that are easy and efficient to evaluate during each optimization step. In the second step, we selected all potentially associated SNPs and evaluated them in a multi-locus model. From simulation studies, our algorithm significantly reduces the computing time. The new method also shows high statistical power in detecting significant markers, high accuracy in marker effect estimation and low false positive rate. We also used the new method to identify relevant genes in real data analysis. We recommend our approach as a fast and reliable method for carrying out a multi-locus genome-wide association study.

Download Full-text

Statistical testing and power analysis for brain-wide association study

10.1101/089870 ◽

2016 ◽

Cited By ~ 1

Author(s):

Weikang Gong ◽

Lin Wan ◽

Wenlian Lu ◽

Liang Ma ◽

Fan Cheng ◽

...

Keyword(s):

False Positive ◽

Power Analysis ◽

Statistical Power ◽

Spatial Information ◽

Association Studies ◽

False Positive Rate ◽

Gaussian Random Field ◽

Resting State Fmri ◽

Statistical Testing ◽

Positive Rate

AbstractThe identification of connexel-wise associations, which involves examining functional connectivities between pairwise voxels across the whole brain, is both statistically and computationally challenging. Although such a connexel-wise methodology has recently been adopted by brain-wide association studies (BWAS) to identify connectivity changes in several mental disorders, such as schizophrenia, autism and depression [Cheng et al., 2015a,b, 2016], the multiple correction and power analysis methods designed specifically for connexel-wise analysis are still lacking. Therefore, we herein report the development of a rigorous statistical framework for connexel-wise significance testing based on the Gaussian random field theory. It includes controlling the family-wise error rate (FWER) of multiple hypothesis testings using topological inference methods, and calculating power and sample size for a connexel-wise study. Our theoretical framework can control the false-positive rate accurately, as validated empirically using two resting-state fMRI datasets. Compared with Bonferroni correction and false discovery rate (FDR), it can reduce false-positive rate and increase statistical power by appropriately utilizing the spatial information of fMRI data. Importantly, our method considerably reduces the computational complexity of a permutation-or simulation-based approach, thus, it can efficiently tackle large datasets with ultra-high resolution images. The utility of our method is shown in a case-control study. Our approach can identify altered functional connectivities in a major depression disorder dataset, whereas existing methods failed. A software package is available at https://github.com/weikanggong/BWAS.

Download Full-text

DISEASES: Text mining and data integration of disease–gene associations

10.1101/008425 ◽

2014 ◽

Cited By ~ 7

Author(s):

Sune Pletscher-Frankild ◽

Albert Pallejà ◽

Kalliopi Tsafou ◽

Janos X Binder ◽

Lars Juhl Jensen

Keyword(s):

Text Mining ◽

Disease Gene ◽

Association Studies ◽

False Positive Rate ◽

Named Entity Recognition ◽

Entity Recognition ◽

Genome Wide Association Studies ◽

Human Genes ◽

Gene Associations ◽

Positive Rate

Text mining is a flexible technology that can be applied to numerous different tasks in biology and medicine. We present a system for extracting disease–gene associations from biomedical abstracts. The system consists of a highly efficient dictionary-based tagger for named entity recognition of human genes and diseases, which we combine with a scoring scheme that takes into account co-occurrences both within and between sentences. We show that this approach is able to extract half of all manually curated associations with a false positive rate of only 0.16%. Nonetheless, text mining should not stand alone, but be combined with other types of evidence. For this reason, we have developed the DISEASES resource, which integrates the results from text mining with manually curated disease–gene associations, cancer mutation data, and genome-wide association studies from existing databases. The DISEASES resource is accessible through a user-friendly web interface at http://diseases.jensenlab.org/, where the text-mining software and all associations are also freely available for download.

Download Full-text