simGWAS: a fast method for simulation of large scale case-control GWAS summarystatistics

AbstractMotivationMethods for analysis of GWAS summary statistics have encouraged data sharing and democratised the analysis of different diseases. Ideal validation for such methods is application to simulated data, where some “truth” is known. As GWAS increase in size, so does the computational complexity of such evaluations; standard practice repeatedly simulates and analyses genotype data for all individuals in an example study.ResultsWe have developed a novel method based on an alternative approach, directly simulating GWAS summary data, without individual data as an intermediate step. We mathematically derive the expected statistics for any set of causal variants and their effect sizes, conditional upon control haplotype frequencies (available from public reference datasets). Simulation of GWAS summary output can be conducted independently of sample size by simulating random variates about these expected values. Across a range of scenarios, our method, produces very similar output to that from simulating individual genotypes with a substantial gain in speed even for modest sample sizes. Fast simulation of GWAS summary statistics will enable more complete and rapid evaluation of summary statistic methods as well as opening new potential avenues of research in fine mapping and gene set enrichment analysis.Availability and ImplementationOur method is available under a GPL license as an R package from http://github.com/chr1swallace/[email protected] InformationSupplementary Information is appended.

Download Full-text

simGWAS: a fast method for simulation of large scale case–control GWAS summary statistics

Bioinformatics ◽

10.1093/bioinformatics/bty898 ◽

2018 ◽

Vol 35 (11) ◽

pp. 1901-1906 ◽

Cited By ~ 4

Author(s):

Mary D Fortune ◽

Chris Wallace

Keyword(s):

Large Scale ◽

Simulated Data ◽

Enrichment Analysis ◽

R Package ◽

Gene Set Enrichment Analysis ◽

Supplementary Information ◽

Intermediate Step ◽

Fast Method ◽

Summary Statistics ◽

Causal Variants

Abstract Motivation Methods for analysis of GWAS summary statistics have encouraged data sharing and democratized the analysis of different diseases. Ideal validation for such methods is application to simulated data, where some ‘truth’ is known. As GWAS increase in size, so does the computational complexity of such evaluations; standard practice repeatedly simulates and analyses genotype data for all individuals in an example study. Results We have developed a novel method based on an alternative approach, directly simulating GWAS summary data, without individual data as an intermediate step. We mathematically derive the expected statistics for any set of causal variants and their effect sizes, conditional upon control haplotype frequencies (available from public reference datasets). Simulation of GWAS summary output can be conducted independently of sample size by simulating random variates about these expected values. Across a range of scenarios, our method, produces very similar output to that from simulating individual genotypes with a substantial gain in speed even for modest sample sizes. Fast simulation of GWAS summary statistics will enable more complete and rapid evaluation of summary statistic methods as well as opening new potential avenues of research in fine mapping and gene set enrichment analysis. Availability and implementation Our method is available under a GPL license as an R package from http://github.com/chr1swallace/simGWAS. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

DECO: decompose heterogeneous population cohorts for patient stratification and discovery of sample biomarkers using omic data profiling

Bioinformatics ◽

10.1093/bioinformatics/btz148 ◽

2019 ◽

Vol 35 (19) ◽

pp. 3651-3662 ◽

Cited By ~ 1

Author(s):

F J Campos-Laborie ◽

A Risueño ◽

M Ortiz-Estévez ◽

B Rosón-Burgo ◽

C Droste ◽

...

Keyword(s):

Correspondence Analysis ◽

Large Scale ◽

Simulated Data ◽

R Package ◽

Heterogeneous Data ◽

Supplementary Information ◽

Patient Stratification ◽

Differential Analysis ◽

Data Profiling ◽

Omic Data

Abstract Motivation Patient and sample diversity is one of the main challenges when dealing with clinical cohorts in biomedical genomics studies. During last decade, several methods have been developed to identify biomarkers assigned to specific individuals or subtypes of samples. However, current methods still fail to discover markers in complex scenarios where heterogeneity or hidden phenotypical factors are present. Here, we propose a method to analyze and understand heterogeneous data avoiding classical normalization approaches of reducing or removing variation. Results DEcomposing heterogeneous Cohorts using Omic data profiling (DECO) is a method to find significant association among biological features (biomarkers) and samples (individuals) analyzing large-scale omic data. The method identifies and categorizes biomarkers of specific phenotypic conditions based on a recurrent differential analysis integrated with a non-symmetrical correspondence analysis. DECO integrates both omic data dispersion and predictor–response relationship from non-symmetrical correspondence analysis in a unique statistic (called h-statistic), allowing the identification of closely related sample categories within complex cohorts. The performance is demonstrated using simulated data and five experimental transcriptomic datasets, and comparing to seven other methods. We show DECO greatly enhances the discovery and subtle identification of biomarkers, making it especially suited for deep and accurate patient stratification. Availability and implementation DECO is freely available as an R package (including a practical vignette) at Bioconductor repository (http://bioconductor.org/packages/deco/). Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

mCSEA: detecting subtle differentially methylated regions

Bioinformatics ◽

10.1093/bioinformatics/btz096 ◽

2019 ◽

Vol 35 (18) ◽

pp. 3257-3262 ◽

Cited By ~ 4

Author(s):

Jordi Martorell-Marugán ◽

Víctor González-Rumayor ◽

Pedro Carmona-Sáez

Keyword(s):

Gene Expression ◽

Expression Patterns ◽

Enrichment Analysis ◽

R Package ◽

Gene Set Enrichment Analysis ◽

Supplementary Information ◽

Differentially Methylated Regions ◽

Sibling Pairs ◽

Complex Disorders ◽

Obesity And Diabetes

Abstract Motivation The identification of differentially methylated regions (DMRs) among phenotypes is one of the main goals of epigenetic analysis. Although there are several methods developed to detect DMRs, most of them are focused on detecting relatively large differences in methylation levels and fail to detect moderate, but consistent, methylation changes that might be associated to complex disorders. Results We present mCSEA, an R package that implements a Gene Set Enrichment Analysis method to identify DMRs from Illumina450K and EPIC array data. It is especially useful for detecting subtle, but consistent, methylation differences in complex phenotypes. mCSEA also implements functions to integrate gene expression data and to detect genes with significant correlations among methylation and gene expression patterns. Using simulated datasets we show that mCSEA outperforms other tools in detecting DMRs. In addition, we applied mCSEA to a previously published dataset of sibling pairs discordant for intrauterine hyperglycemia exposure. We found several differentially methylated promoters in genes related to metabolic disorders like obesity and diabetes, demonstrating the potential of mCSEA to identify DMRs not detected by other methods. Availability and implementation mCSEA is freely available from the Bioconductor repository. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Gene set enrichment analysis for genome-wide DNA methylation data

Genome Biology ◽

10.1186/s13059-021-02388-x ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Jovana Maksimovic ◽

Alicia Oshlack ◽

Belinda Phipson

Keyword(s):

Dna Methylation ◽

Enrichment Analysis ◽

R Package ◽

Gene Set Enrichment Analysis ◽

Methylation Array ◽

Gene Set ◽

Genome Wide ◽

Genome Methylation ◽

Unbiased Gene ◽

Gene Set Testing

AbstractDNA methylation is one of the most commonly studied epigenetic marks, due to its role in disease and development. Illumina methylation arrays have been extensively used to measure methylation across the human genome. Methylation array analysis has primarily focused on preprocessing, normalization, and identification of differentially methylated CpGs and regions. GOmeth and GOregion are new methods for performing unbiased gene set testing following differential methylation analysis. Benchmarking analyses demonstrate GOmeth outperforms other approaches, and GOregion is the first method for gene set testing of differentially methylated regions. Both methods are publicly available in the missMethyl Bioconductor R package.

Download Full-text

TFEA.ChIP: a tool kit for transcription factor binding site enrichment analysis capitalizing on ChIP-seq datasets

Bioinformatics ◽

10.1093/bioinformatics/btz573 ◽

2019 ◽

Vol 35 (24) ◽

pp. 5339-5340 ◽

Cited By ~ 8

Author(s):

Laura Puente-Santamaria ◽

Wyeth W Wasserman ◽

Luis del Peso

Keyword(s):

Genomic Analysis ◽

Enrichment Analysis ◽

R Package ◽

Supplementary Information ◽

Web Based ◽

Factor Binding Site ◽

Gene Sets ◽

Transcription Regulators ◽

Computational Identification ◽

On Chip

Abstract Summary The computational identification of the transcription factors (TFs) [more generally, transcription regulators, (TR)] responsible for the co-regulation of a specific set of genes is a common problem found in genomic analysis. Herein, we describe TFEA.ChIP, a tool that makes use of ChIP-seq datasets to estimate and visualize TR enrichment in gene lists representing transcriptional profiles. We validated TFEA.ChIP using a wide variety of gene sets representing signatures of genetic and chemical perturbations as input and found that the relevant TR was correctly identified in 126 of a total of 174 analyzed. Comparison with other TR enrichment tools demonstrates that TFEA.ChIP is an highly customizable package with an outstanding performance. Availability and implementation TFEA.ChIP is implemented as an R package available at Bioconductor https://www.bioconductor.org/packages/devel/bioc/html/TFEA.ChIP.html and github https://github.com/LauraPS1/TFEA.ChIP_downloads. A web-based GUI to the package is also available at https://www.iib.uam.es/TFEA.ChIP/ Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Tumor Expression Profile Analysis Developed and Validated a Prognostic Model Based on Immune-Related Genes in Bladder Cancer

Frontiers in Genetics ◽

10.3389/fgene.2021.696912 ◽

2021 ◽

Vol 12 ◽

Author(s):

Bingqi Dong ◽

Jiaming Liang ◽

Ding Li ◽

Wenping Song ◽

Shiming Zhao ◽

...

Keyword(s):

Bladder Cancer ◽

Regression Analysis ◽

Malignant Tumors ◽

Immune Therapy ◽

Enrichment Analysis ◽

R Package ◽

Gene Set Enrichment Analysis ◽

Lasso Regression ◽

Potential Biomarker ◽

Immune Related Genes

Background: Bladder cancer (BLCA) ranks 10th in incidence among malignant tumors and 6th in incidence among malignant tumors in males. With the application of immune therapy, the overall survival (OS) rate of BLCA patients has greatly improved, but the 5-year survival rate of BLCA patients is still low. Furthermore, not every BLCA patient benefits from immunotherapy, and there are a limited number of biomarkers for predicting the immunotherapy response. Therefore, novel biomarkers for predicting the immunotherapy response and prognosis of BLCA are urgently needed.Methods: The RNA sequencing (RNA-seq) data, clinical data and gene annotation files for The Cancer Genome Atlas (TCGA) BLCA cohort were extracted from the University of California, Santa Cruz (UCSC) Xena Browser. The BLCA datasets GSE31684 and GSE32894 from the Gene Expression Omnibus (GEO) database were extracted for external validation. Immune-related genes were extracted from InnateDB. Significant differentially expressed genes (DEGs) were identified using the R package “limma,” and Gene Ontology (GO) analysis and Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment analysis for the DEGs were performed using R package “clusterProfiler.” Least absolute shrinkage and selection operator (LASSO) regression analysis were used to construct the signature model. The infiltration level of each immune cell type was estimated using the single-sample gene set enrichment analysis (ssGSEA) algorithm. The performance of the model was evaluated with receiver operating characteristic (ROC) curves and calibration curves.Results: In total, 1,040 immune-related DEGs were identified, and eight signature genes were selected to construct a model using LASSO regression analysis. The risk score of BLCA patients based on the signature model was negatively correlated with OS and the immunotherapy response. The ROC curve for OS revealed that the model had good accuracy. The calibration curve showed good agreement between the predictions and actual observations.Conclusions: Herein, we constructed an immune-related eight-gene signature that could be a potential biomarker to predict the immunotherapy response and prognosis of BLCA patients.

Download Full-text

Lipid Mini-On: mining and ontology tool for enrichment analysis of lipidomic data

Bioinformatics ◽

10.1093/bioinformatics/btz250 ◽

2019 ◽

Vol 35 (21) ◽

pp. 4507-4508 ◽

Cited By ~ 9

Author(s):

Geremy Clair ◽

Sarah Reehl ◽

Kelly G Stratton ◽

Matthew E Monroe ◽

Malak M Tfaily ◽

...

Keyword(s):

Peat Soil ◽

Enrichment Analysis ◽

R Package ◽

Lipid Classes ◽

Supplementary Information ◽

Mass Spec ◽

Shiny App ◽

Lung Endothelial Cells ◽

Lipid Enrichment ◽

The Individual

Abstract Summary Here we introduce Lipid Mini-On, an open-source tool that performs lipid enrichment analyses and visualizations of lipidomics data. Lipid Mini-On uses a text-mining process to bin individual lipid names into multiple lipid ontology groups based on the classification (e.g. LipidMaps) and other characteristics, such as chain length. Lipid Mini-On provides users with the capability to conduct enrichment analysis of the lipid ontology terms using a Shiny app with options of five statistical approaches. Lipid classes can be added to customize the user’s database and remain updated as new lipid classes are discovered. Visualization of results is available for all classification options (e.g. lipid subclass and individual fatty acid chains). Results are also visualized through an editable network of relationships between the individual lipids and their associated lipid ontology terms. The utility of the tool is demonstrated using biological (e.g. human lung endothelial cells) and environmental (e.g. peat soil) samples. Availability and implementation Rodin (R package: https://github.com/PNNL-Comp-Mass-Spec/Rodin), Lipid Mini-On Shiny app (https://github.com/PNNL-Comp-Mass-Spec/LipidMiniOn) and Lipid Mini-On online tool (https://omicstools.pnnl.gov/shiny/lipid-mini-on/). Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

The G Protein-Coupled Estrogen Receptor (GPER) Expression Correlates with Pro-Metastatic Pathways in ER-Negative Breast Cancer: A Bioinformatics Analysis

Cells ◽

10.3390/cells9030622 ◽

2020 ◽

Vol 9 (3) ◽

pp. 622 ◽

Cited By ~ 4

Author(s):

Marianna Talia ◽

Ernestina De Francesco ◽

Damiano Rigiracciolo ◽

Maria Muoio ◽

Lucia Muglia ◽

...

Keyword(s):

Breast Cancer ◽

Estrogen Receptor ◽

Signaling Pathways ◽

G Protein ◽

Bioinformatics Analysis ◽

Enrichment Analysis ◽

R Package ◽

Gene Set Enrichment Analysis ◽

Pathway Enrichment Analysis ◽

G Protein Coupled

The G protein-coupled estrogen receptor (GPER, formerly known as GPR30) is a seven-transmembrane receptor that mediates estrogen signals in both normal and malignant cells. In particular, GPER has been involved in the activation of diverse signaling pathways toward transcriptional and biological responses that characterize the progression of breast cancer (BC). In this context, a correlation between GPER expression and worse clinical-pathological features of BC has been suggested, although controversial data have also been reported. In order to better assess the biological significance of GPER in the aggressive estrogen receptor (ER)-negative BC, we performed a bioinformatics analysis using the information provided by The Invasive Breast Cancer Cohort of The Cancer Genome Atlas (TCGA) project and Molecular Taxonomy of Breast Cancer International Consortium (METABRIC) datasets. Gene expression correlation and the statistical analysis were carried out with R studio base functions and the tidyverse package. Pathway enrichment analysis was evaluated with Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway on the Database for Annotation, Visualization and Integrated Discovery (DAVID) website, whereas gene set enrichment analysis (GSEA) was performed with the R package phenoTest. The survival analysis was determined with the R package survivALL. Analyzing the expression data of more than 2500 primary BC, we ascertained that GPER levels are associated with pro-migratory and metastatic genes belonging to cell adhesion molecules (CAMs), extracellular matrix (ECM)-receptor interaction, and focal adhesion (FA) signaling pathways. Thereafter, evaluating the disease-free interval (DFI) in ER-negative BC patients, we found that the subjects expressing high GPER levels exhibited a shorter DFI in respect to those exhibiting low GPER levels. Overall, our results may pave the way to further dissect the network triggered by GPER in the breast malignancies lacking ER toward a better assessment of its prognostic significance and the action elicited in mediating the aggressive features of the aforementioned BC subtype.

Download Full-text

RMTL: an R library for multi-task learning

Bioinformatics ◽

10.1093/bioinformatics/bty831 ◽

2018 ◽

Vol 35 (10) ◽

pp. 1797-1798 ◽

Cited By ~ 2

Author(s):

Han Cao ◽

Jiayu Zhou ◽

Emanuel Schwarz

Keyword(s):

Biological Networks ◽

Simulated Data ◽

R Package ◽

Low Rank ◽

Supplementary Information ◽

Supplementary Data ◽

Software Environment ◽

Machine Learning Technique ◽

Task Learning ◽

Learning Technique

Abstract Motivation Multi-task learning (MTL) is a machine learning technique for simultaneous learning of multiple related classification or regression tasks. Despite its increasing popularity, MTL algorithms are currently not available in the widely used software environment R, creating a bottleneck for their application in biomedical research. Results We developed an efficient, easy-to-use R library for MTL (www.r-project.org) comprising 10 algorithms applicable for regression, classification, joint predictor selection, task clustering, low-rank learning and incorporation of biological networks. We demonstrate the utility of the algorithms using simulated data. Availability and implementation The RMTL package is an open source R package and is freely available at https://github.com/transbioZI/RMTL. RMTL will also be available on cran.r-project.org. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Detecting and correcting misclassified sequences in the large-scale public databases

Bioinformatics ◽

10.1093/bioinformatics/btaa586 ◽

2020 ◽

Vol 36 (18) ◽

pp. 4699-4705

Author(s):

Hamid Bagheri ◽

Andrew J Severin ◽

Hridesh Rajan

Keyword(s):

Large Scale ◽

Sequence Similarity ◽

Heuristic Method ◽

Simulated Data ◽

Supplementary Information ◽

Small Subset ◽

Taxonomic Assignment ◽

User Input ◽

Public Repositories ◽

Taxonomic Assignments

Abstract Motivation As the cost of sequencing decreases, the amount of data being deposited into public repositories is increasing rapidly. Public databases rely on the user to provide metadata for each submission that is prone to user error. Unfortunately, most public databases, such as non-redundant (NR), rely on user input and do not have methods for identifying errors in the provided metadata, leading to the potential for error propagation. Previous research on a small subset of the NR database analyzed misclassification based on sequence similarity. To the best of our knowledge, the amount of misclassification in the entire database has not been quantified. We propose a heuristic method to detect potentially misclassified taxonomic assignments in the NR database. We applied a curation technique and quality control to find the most probable taxonomic assignment. Our method incorporates provenance and frequency of each annotation from manually and computationally created databases and clustering information at 95% similarity. Results We found more than two million potentially taxonomically misclassified proteins in the NR database. Using simulated data, we show a high precision of 97% and a recall of 87% for detecting taxonomically misclassified proteins. The proposed approach and findings could also be applied to other databases. Availability and implementation Source code, dataset, documentation, Jupyter notebooks and Docker container are available at https://github.com/boalang/nr. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text