scholarly journals pathwayPCA: an R package for integrative pathway analysis with modern PCA methodology and gene selection

2019 ◽  
Author(s):  
Gabriel J. Odom ◽  
Yuguang Ban ◽  
Lizhong Liu ◽  
Xiaodian Sun ◽  
Alexander R. Pico ◽  
...  

ABSTRACTWith the advance in high-throughput technology for molecular assays, multi-omics datasets have become increasingly available. However, most currently available pathway analysis software provide little or no functionalities for analyzing multiple types of -omics data simultaneously. In addition, most tools do not provide sample-specific estimates of pathway activities, which are important for precision medicine. To address these challenges, we present pathwayPCA, a unique R package for integrative pathway analysis that utilizes modern statistical methodology including supervised PCA and adaptive elastic-net PCA for principal component analysis. pathwayPCA can analyze continuous, binary, and survival outcomes in studies with multiple covariate and/or interaction effects. We provide three case studies to illustrate pathway analysis with gene selection, integrative analysis of multi-omics datasets to identify driver genes, estimating and visualizing sample-specific pathway activities in ovarian cancer, and identifying sex-specific pathway effects in kidney cancer. pathwayPCA is an open source R package, freely available to the research community. We expect pathwayPCA to be a useful tool for empowering the wide scientific community on the analyses and interpretation of the wealth of multiomics data recently made available by TCGA, CPTAC and other large consortiums.

Author(s):  
Xi Chen

Pathway or gene set analysis has become an increasingly popular approach for analyzing high-throughput biological experiments such as microarray gene expression studies. The purpose of pathway analysis is to identify differentially expressed pathways associated with outcomes. Important challenges in pathway analysis are selecting a subset of genes contributing most to association with clinical phenotypes and conducting statistical tests of association for the pathways efficiently. We propose a two-stage analysis strategy: (1) extract latent variables representing activities within each pathway using a dimension reduction approach based on adaptive elastic-net sparse principal component analysis; (2) integrate the latent variables with the regression modeling framework to analyze studies with different types of outcomes such as binary, continuous or survival outcomes. Our proposed approach is computationally efficient. For each pathway, because the latent variables are estimated in an unsupervised fashion without using disease outcome information, in the sample label permutation testing procedure, the latent variables only need to be calculated once rather than for each permutation resample. Using both simulated and real datasets, we show our approach performed favorably when compared with five other currently available pathway testing methods.


2016 ◽  
Vol 113 (51) ◽  
pp. 14662-14667 ◽  
Author(s):  
Zhixiang Lin ◽  
Can Yang ◽  
Ying Zhu ◽  
John Duchi ◽  
Yao Fu ◽  
...  

Dimension reduction methods are commonly applied to high-throughput biological datasets. However, the results can be hindered by confounding factors, either biological or technical in origin. In this study, we extend principal component analysis (PCA) to propose AC-PCA for simultaneous dimension reduction and adjustment for confounding (AC) variation. We show that AC-PCA can adjust for (i) variations across individual donors present in a human brain exon array dataset and (ii) variations of different species in a model organism ENCODE RNA sequencing dataset. Our approach is able to recover the anatomical structure of neocortical regions and to capture the shared variation among species during embryonic development. For gene selection purposes, we extend AC-PCA with sparsity constraints and propose and implement an efficient algorithm. The methods developed in this paper can also be applied to more general settings. The R package and MATLAB source code are available athttps://github.com/linzx06/AC-PCA.


Author(s):  
Martin Pirkl ◽  
Niko Beerenwinkel

Abstract Motivation Cancer is one of the most prevalent diseases in the world. Tumors arise due to important genes changing their activity, e.g. when inhibited or over-expressed. But these gene perturbations are difficult to observe directly. Molecular profiles of tumors can provide indirect evidence of gene perturbations. However, inferring perturbation profiles from molecular alterations is challenging due to error-prone molecular measurements and incomplete coverage of all possible molecular causes of gene perturbations. Results We have developed a novel mathematical method to analyze cancer driver genes and their patient-specific perturbation profiles. We combine genetic aberrations with gene expression data in a causal network derived across patients to infer unobserved perturbations. We show that our method can predict perturbations in simulations, CRISPR perturbation screens and breast cancer samples from The Cancer Genome Atlas. Availability and implementation The method is available as the R-package nempi at https://github.com/cbg-ethz/nempi and http://bioconductor.org/packages/nempi. Supplementary information Supplementary data are available at Bioinformatics online.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Kota Fujisawa ◽  
Mamoru Shimo ◽  
Y.-H. Taguchi ◽  
Shinya Ikematsu ◽  
Ryota Miyata

AbstractCoronavirus disease 2019 (COVID-19) is raging worldwide. This potentially fatal infectious disease is caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). However, the complete mechanism of COVID-19 is not well understood. Therefore, we analyzed gene expression profiles of COVID-19 patients to identify disease-related genes through an innovative machine learning method that enables a data-driven strategy for gene selection from a data set with a small number of samples and many candidates. Principal-component-analysis-based unsupervised feature extraction (PCAUFE) was applied to the RNA expression profiles of 16 COVID-19 patients and 18 healthy control subjects. The results identified 123 genes as critical for COVID-19 progression from 60,683 candidate probes, including immune-related genes. The 123 genes were enriched in binding sites for transcription factors NFKB1 and RELA, which are involved in various biological phenomena such as immune response and cell survival: the primary mediator of canonical nuclear factor-kappa B (NF-κB) activity is the heterodimer RelA-p50. The genes were also enriched in histone modification H3K36me3, and they largely overlapped the target genes of NFKB1 and RELA. We found that the overlapping genes were downregulated in COVID-19 patients. These results suggest that canonical NF-κB activity was suppressed by H3K36me3 in COVID-19 patient blood.


2018 ◽  
Vol 17 ◽  
pp. 117693511877108 ◽  
Author(s):  
Min Wang ◽  
Steven M Kornblau ◽  
Kevin R Coombes

Principal component analysis (PCA) is one of the most common techniques in the analysis of biological data sets, but applying PCA raises 2 challenges. First, one must determine the number of significant principal components (PCs). Second, because each PC is a linear combination of genes, it rarely has a biological interpretation. Existing methods to determine the number of PCs are either subjective or computationally extensive. We review several methods and describe a new R package, PCDimension, that implements additional methods, the most important being an algorithm that extends and automates a graphical Bayesian method. Using simulations, we compared the methods. Our newly automated procedure is competitive with the best methods when considering both accuracy and speed and is the most accurate when the number of objects is small compared with the number of attributes. We applied the method to a proteomics data set from patients with acute myeloid leukemia. Proteins in the apoptosis pathway could be explained using 6 PCs. By clustering the proteins in PC space, we were able to replace the PCs by 6 “biological components,” 3 of which could be immediately interpreted from the current literature. We expect this approach combining PCA with clustering to be widely applicable.


2011 ◽  
Author(s):  
Stephen C. Benz ◽  
Charles Vaske ◽  
Sam Ng ◽  
John Zachary Sanborn ◽  
Jing Zhu ◽  
...  

2020 ◽  
Author(s):  
Kumari Sonal Choudhary ◽  
Eoin Fahy ◽  
Kevin Coakley ◽  
Manish Sud ◽  
Mano R Maurya ◽  
...  

ABSTRACTWith the advent of high throughput mass spectrometric methods, metabolomics has emerged as an essential area of research in biomedicine with the potential to provide deep biological insights into normal and diseased functions in physiology. However, to achieve the potential offered by metabolomics measures, there is a need for biologist-friendly integrative analysis tools that can transform data into mechanisms that relate to phenotypes. Here, we describe MetENP, an R package, and a user-friendly web application deployed at the Metabolomics Workbench site extending the metabolomics enrichment analysis to include species-specific pathway analysis, pathway enrichment scores, gene-enzyme information, and enzymatic activities of the significantly altered metabolites. MetENP provides a highly customizable workflow through various user-specified options and includes support for all metabolite species with available KEGG pathways. MetENPweb is a web application for calculating metabolite and pathway enrichment analysis.Availability and ImplementationThe MetENP package is freely available from Metabolomics Workbench GitHub: (https://github.com/metabolomicsworkbench/MetENP), the web application, is freely available at (https://www.metabolomicsworkbench.org/data/analyze.php)


2019 ◽  
Vol 47 (13) ◽  
pp. 6642-6655 ◽  
Author(s):  
Nadav Brandes ◽  
Nathan Linial ◽  
Michal Linial

Abstract Compiling the catalogue of genes actively involved in cancer is an ongoing endeavor, with profound implications to the understanding and treatment of the disease. An abundance of computational methods have been developed to screening the genome for candidate driver genes based on genomic data of somatic mutations in tumors. Existing methods make many implicit and explicit assumptions about the distribution of random mutations. We present FABRIC, a new framework for quantifying the selection of genes in cancer by assessing the effects of de-novo somatic mutations on protein-coding genes. Using a machine-learning model, we quantified the functional effects of ∼3M somatic mutations extracted from over 10 000 human cancerous samples, and compared them against the effects of all possible single-nucleotide mutations in the coding human genome. We detected 593 protein-coding genes showing statistically significant bias towards harmful mutations. These genes, discovered without any prior knowledge, show an overwhelming overlap with known cancer genes, but also include many overlooked genes. FABRIC is designed to avoid false discoveries by comparing each gene to its own background model using rigorous statistics, making minimal assumptions about the distribution of random somatic mutations. The framework is an open-source project with a simple command-line interface.


Sign in / Sign up

Export Citation Format

Share Document